Extract Metadata

This section demonstrates how to extract metadata from an HTML page and add it to a document field.

Consider the following HTML:

<h1>This is a title</h1>
<h2>This is a sub-title</h2>
<p class="important">This is <strong>important</strong> text</p>

From this HTML you could extract all of the headings and add them to a metadata field named heading. You could also extract the important text and add that to a separate document field.

The configuration parameters MetadataSelector and MetadataFieldName select the information to extract and provide the name of the destination document field. These parameters must be set in numbered pairs (so that each MetadataSelector parameter has a matching MetadataFieldName). The MetadataSelector parameter accepts standard CSS2 selectors.

The following configuration would extract the information described above:

MetadataSelector0=h1,h2,h3
MetadataFieldName0=heading
MetadataSelector1=p.important
MetadataFieldName1=important_paragraph
MetadataSelectorExtractPlainText=TRUE

The parameter MetadataSelectorExtractPlainText specifies whether to extract as plain text (removing HTML markup, for example).

The configuration above would produce the following metadata fields:

#DREFIELD heading="This is a title"
#DREFIELD heading="This is a sub-title"
#DREFIELD important_paragraph="This is important text"