Split Web Pages into Multiple Documents
You might want to split pages into multiple documents. For example, if you ingest pages from a discussion board you might want to ingest one document for each message on the page.
Connector Framework Server can create documents for sections of a Web page identified using CSS selectors. CFS creates a child document for each section of the page that is identified. Metadata fields (named CHILD_DOCUMENT
) are added to the parent document, to refer to the child documents.
To split pages into multiple documents, add the following parameters to your WKOOPHtmlExtraction
task:
ChildDocumentSelector
|
A CSS2 selector that identifies the root element of each child document in the page source. |
ChildReferenceSelector
|
(Optional) An element in the child document that contains a value to use as the document reference. The value you extract should be unique for each child document, because it is used as part of the DREREFERENCE field in the child document. If you do not set this parameter, the connector uses a GUID. Specify the element using a CSS2 selector, relative to the element identified by ChildDocumentSelector . |
For example, consider the following example page which represents messages on a page of a discussion board:
<html> <head> <title>Example Page</title> <meta charset="utf-8"> </head> <body> <div> <h1>Example Page</h1> <div class="content"> <p>content</p> </div> <div class="message"> <h1>Message 1</h1> <p class="meta">some metadata</p> <p>some content</p> </div> <div class="message"> <h1>Message 2</h1> <p class="meta">some metadata</p> <p>some content</p> </div> ... </div> </body> </html>
To create separate documents for the messages contained on this page, you could use the following configuration:
[MyTask] ... ChildDocumentSelector=div.message ChildReferenceSelector=h1
This example would produce the following child document (and a similar document for the second message):
#DREREFERENCE <current_document_reference>:<child_reference> ... #DRECONTENT Message 1 some metadata some content ...
The value of the DREREFERENCE
field is constructed from the reference of the original document and the value of the element identified by the ChildReferenceSelector
configuration parameter. If you don't set this configuration parameter or the element is not found, CFS uses a GUID instead.
CFS adds the reference of the original document to the fields DREPARENTREFERENCE
and DREROOTPARENTREFERENCE
. It also adds an HTML_PROCESSING
metadata field that contains any metadata and links that are extracted from the child document.
The DRECONTENT
field is populated with text extracted from the HTML elements that you identified as belonging to the child document.
Connector Framework Server automatically adds fields to the parent document, named CHILD_DOCUMENT
, that contain the references of associated child documents.