Extract Metadata
This section demonstrates how to extract metadata from an HTML page and add it to a document field.
Consider the following HTML:
<h1>This is a title</h1> <h2>This is a sub-title</h2> <p class="important">This is <strong>important</strong> text</p>
From this HTML you could extract all of the headings and add them to a metadata field named heading
. You could also extract the important text and add that to a separate document field.
The configuration parameters MetadataSelector
and MetadataFieldName
select the information to extract and provide the name of the destination document field. These parameters must be set in numbered pairs (so that each MetadataSelector
parameter has a matching MetadataFieldName
). The MetadataSelector
parameter accepts standard CSS2 selectors.
The following configuration would extract the information described above:
MetadataSelector0=h1,h2,h3 MetadataFieldName0=heading MetadataSelector1=p.important MetadataFieldName1=important_paragraph MetadataSelectorExtractPlainText=TRUE
The parameter MetadataSelectorExtractPlainText
specifies whether to extract as plain text (removing HTML markup, for example).
The configuration above would produce the following metadata fields:
#DREFIELD heading="This is a title" #DREFIELD heading="This is a sub-title" #DREFIELD important_paragraph="This is important text"