Extract Metadata

This section demonstrates how to extract metadata from an HTML page and add it to a document field.

Consider the following HTML:

<h1>This is a title</h1>
<h2>This is a sub-title</h2>
<p class="important">This is <strong>important</strong> text</p>

From this HTML you could extract all of the headings and add them to a metadata field named heading. You could also extract the important text and add that to a separate document field.

The following configuration would extract the information described above:

[WkoopHtmlExtractionTask]
...
MetadataFieldSections0=ExtractHeadings
MetadataFieldSections1=ExtractImportant

[ExtractHeadings]
MetadataSelector=h1,h2,h3
MetadataFieldName=heading

[ExtractImportant]
MetadataSelector=p.important
MetadataFieldName=important_paragraph
MetadataExtractPlainText=TRUE

The configuration parameters MetadataSelector and MetadataFieldName select the information to extract and provide the name of the destination document field.

The parameter MetadataExtractPlainText specifies whether to extract as plain text (removing HTML markup, for example).

The configuration above would produce the following metadata fields:

#DREFIELD heading="This is a title"
#DREFIELD heading="This is a sub-title"
#DREFIELD important_paragraph="This is important text"