Extract Metadata
This section demonstrates how to extract metadata from an HTML page and add it to a document field.
Consider the following HTML:
<h1>This is a title</h1> <h2>This is a sub-title</h2> <p class="important">This is <strong>important</strong> text</p>
From this HTML you could extract all of the headings and add them to a metadata field named heading
. You could also extract the important text and add that to a separate document field.
The following configuration would extract the information described above:
[WkoopHtmlExtractionTask] ... MetadataFieldSections0=ExtractHeadings MetadataFieldSections1=ExtractImportant [ExtractHeadings] MetadataSelector=h1,h2,h3 MetadataFieldName=heading [ExtractImportant] MetadataSelector=p.important MetadataFieldName=important_paragraph MetadataExtractPlainText=TRUE
The configuration parameters MetadataSelector
and MetadataFieldName
select the information to extract and provide the name of the destination document field.
The parameter MetadataExtractPlainText
specifies whether to extract as plain text (removing HTML markup, for example).
The configuration above would produce the following metadata fields:
#DREFIELD heading="This is a title" #DREFIELD heading="This is a sub-title" #DREFIELD important_paragraph="This is important text"