You might want to split Web pages into multiple documents. For example, if you ingest pages from a discussion board you might want to ingest one document for each message on the page.
The HPE Web Connector can create documents for sections of a Web page identified using CSS selectors. For each Web page, the connector creates a parent document and an associated file that contains the full page source. It then creates a child document for each section of the page. Each child document has an associated file that contains the <head>
element from the original page, and a <body>
element containing the content identified by your CSS selector. The parent document includes metadata fields (named CHILD_DOCUMENT
) that refer to the child documents, and each child document has a metadata field (PAGE_REFERENCE
) that refers back to the parent document.
To split Web pages into multiple documents
Stop the connector and open the configuration file.
Modify your fetch task by adding the following parameters:
ChildDocumentSection
|
(Optional) A comma-separated list of sections in the HPE Web Connector configuration file that contain settings for creating child documents. If you do not set this parameter, the connector uses the settings in the TaskName section. |
ChildDocumentUrlRegex
|
(Optional) A Perl-compatible regular expression to identify the pages to generate child documents from. The connector does not attempt to generate child documents from a page unless the full URL of a page matches the regular expression. |
ChildDocumentSelector
|
A CSS2 selector that identifies the root element of each child document in the page source. |
ChildReferenceSelector
|
(Optional) An element in the child document that contains a value to use as the document reference. The value you extract should be unique for each child document, because it is used as part of the DREREFERENCE field in the child document. If you do not set this parameter, the connector uses a GUID. Specify the element using a CSS2 selector, relative to the element identified by ChildDocumentSelector . |
ChildMetadataFieldName
|
(Optional) The names to use for document fields (in child documents) that contain information extracted using the parameter ChildMetadataSelector . This parameter must have the same number of values as ChildMetadataSelector . |
ChildMetadataSelector
|
(Optional) A list of elements in the child document that contain metadata. The metadata in these elements are extracted and added to the metadata fields of child documents. To specify the name(s) of the document field(s) to contain the extracted information, set the configuration parameter |
Consider the following example page which represents messages on a page of a discussion board:
<html> <head> <title>Example Page</title> <meta charset="utf-8"> </head> <body> <div> <h1>Example Page</h1> <div class="content"> <p>content</p> </div> <div class="message"> <h1>Message 1</h1> <p class="meta">some metadata</p> <p>some content</p> </div> <div class="message"> <h1>Message 2</h1> <p class="meta">some metadata</p> <p>some content</p> </div> ... </div> </body> </html>
To create separate documents for the messages contained on this page, you could use the following configuration:
[MyTask] ... ChildDocumentSelector=div.message ChildReferenceSelector=h1 ChildMetadataFieldName0=my_metadata ChildMetadataSelector0=p.meta
This example would produce the following child document:
#DREREFERENCE [child][Message 1]http://www.hp.com/example.htm #DREFIELD my_metadata="some metadata" #DREFIELD PAGE_REFERENCE="http://www.hp.com/example.htm"
The PAGE_REFERENCE
field is added automatically by HPE Web Connector and contains the reference of the parent document.
The following file is associated with the child document. The HPE Web Connector includes the <head>
element from the Web page and adds the content identified by ChildDocumentSelector
to the <body>
element. If you send your documents to CFS, this file is filtered by KeyView and the information it contains is used as the document content.
<html> <head> <title>Example Page</title> <meta charset="utf-8"> </head> <body> <div class="message"> <h1>Message 1</h1> <p class="meta">some metadata</p> <p>some content</p> </div> </body> </html>
A similar document and file would be ingested for the second message.
The parent document contains the full page content. The HPE Web Connector automatically adds fields named CHILD_DOCUMENT
, containing the references of associated child documents:
#DREREFERENCE http://www.hp.com/example.htm ... #DREFIELD CHILD_DOCUMENT="[child][Message 1]http://www.hp.com/example.htm" #DREFIELD CHILD_DOCUMENT="[child][Message 2]http://www.hp.com/example.htm" ...
|