Split Files into Multiple Documents

Sometimes you might retrieve files from a repository that you would prefer to ingest as multiple documents.

You can use the TextToDocs task to split a file containing text (for example an HTML file or XML file) into multiple documents. To divide a file, you specify regular expressions that match the relevant parts of the document. The task creates a main document and one or more child documents, which can all have metadata and content. When you run TextToDocs on a document, the original document is discarded. The documents created by TextToDocs are metadata-only documents, which means that they do not have an associated file and are not filtered by KeyView.

The TextToDocs task should be configured as a Pre task. The parameters that are passed to the task are specified in a named section of the configuration file. For example:

[ImportTasks]
Pre0=TextToDocs:MyTextToDocs

[MyTextToDocs]
...

For information about the parameters that you can use to configure this task, refer to the Connector Framework Server Reference.

The TextToDocs task expects documents to use UTF-8 character encoding. If your documents are not encoded in UTF-8 you can use the configuration parameter SourceEncoding to specify the character set encoding of the source documents, so that they can be converted to UTF-8. If conversion fails, the original encoding is used and CFS adds an error message to the ImportErrorCode and ImportErrorDescription document fields.

Example

The following HTML is an example file that you might want to ingest as separate documents. There are clear sections which could represent different topics:

<html>
  <body>
    <p class="main">Main content</p>

    <div class="section">
      <h1>First document</h1>
      <p class="metadata">Extract as metadata</p>
      <p>Some text</p>
    </div>

    <div class="section">
       <h1>Second document</h1>
       <p class="metadata">Extract as metadata</p>
       <p>Some text</p>
    </div>

    <div class="section">
       <h1>Third document</h1>
       <p class="metadata">Extract as metadata</p>
       <p>Some text</p>
    </div>

  </body>
</html>

You might want to split this file into a main document and three child documents, one of which might look like this:

#DREREFERENCE C:\MyFiles\TextToDocs\textToDocs.html:0
#DREDBNAME FileSystem
#DREFIELD MyMetadataField="Extract as metadata"
#DRECONTENT
First document
Some text

#DREENDDOC

To do this, you could use the following configuration:

[ImportTasks]
Pre0=TextToDocs:MyTextToDocs

[MyTextToDocs]
FilenameMatchesRegex0=.*\.html

MainRangeRegex0=<html>(.*)</html>
MainContentRegex0=<p class="main">(.*?)</p>

ChildrenRangeRegex0=<html>(.*)</html>
ChildRangeRegex=<div class="section">(.*?)</div>
ChildContentRegex0=<h1>(.*?)</h1>
ChildContentRegex1=<p>(.*?)</p>
ChildFieldName0=MyMetadataField
ChildFieldRegex0=<p class="metadata">(.*?)</p>
ChildInheritFields=DREDBNAME

In this example, the FilenameMatchesRegex parameter has been set to process only those files that have the extension .html.

The MainContentRegex parameter identifies parts of the original document to add to the DRECONTENT field of the main document.

The ChildRangeRegex parameter identifies the parts of the original document that should become child documents. The sub-match (.*?) matches all of the content between a <div class="section"> tag and a </div> tag. When this regular expression is applied to the example document above, there are three matches and therefore three child documents are created. It is important to make the regular expression lazy, because otherwise it would match everything between the first <div class="section"> and the final </div>, resulting in a single child document.

The ChildContentRegex parameter identifies the content to add to the DRECONTENT field of a child document. In this example two regular expressions are used to extract content. The ChildFieldName and ChildFieldRegex parameters populate metadata fields. In this example a single field named MyMetadataField is created.

Setting the parameter ChildInheritFields=DREDBNAME specifies that the child documents inherit the field DREDBNAME from the original document. If you are indexing documents into IDOL Server it is important to set this parameter, because (depending on how your system is configured) documents without a DREDBNAME field might not be indexed.