ContentIntoSections

Splits IDOL document content into sections.

You might have documents with large amounts of content. For example, a 1000-page PDF file could be indexed as a single document. Dividing the content of long documents into sections can result in more relevant search results, because IDOL can return a specific part of a document in response to a query.

Properties

Name Default Value Description
IDOL License Service  

An IdolLicenseServiceImpl that provides a way to communicate with an IDOL License Server.

Section mandatory separators  

A comma-separated list of separators. The document content is split at every occurrence of any of these separators. The current section will end immediately after the separator and a new section is created for the following content.

A separator string can be specified as a fixed string or as a regular expression. A separator is treated as a regular expression if it begins with an open parenthesis “(“ and ends with closed parenthesis “)”.

Section separators Paragraph breaks, punctuation followed by white space, other suitable separators.

A comma-separated list of separators. The document content might be split at an occurrence of one of these, if the pages either side of the break location are of appropriate sizes. Separators towards the left of a comma-separated list have priority over those toward the right.

If you do not set this property, a default list of separators is used.

For example, you might prefer to split content on paragraph breaks %0A%0A. If there was a large amount of content without paragraph breaks, the processor would revert to splitting on punctuation.

A separator string can be specified as a fixed string or as a regular expression. A separator is treated as a regular expression if it begins with an open parenthesis “(“ and ends with closed parenthesis “)”.

Section minimum bytes 1500 The minimum number of bytes preferred for a section. This is not a hard limit, but section sizes are generally kept above this.
Section maximum bytes 3000 The maximum number of bytes preferred for a section. This is not a hard limit, but section sizes are generally kept below this.
Maximum sections 10000 The maximum number of sections to create. If the maximum number of sections is reached, the last section contains all of the remaining content.
Temp Directory temp A path to a location where content data can be written if required.

NOTE: When you set the Section mandatory separators and Section separators properties, the comma (,) and percent (%) characters must be URL-escaped (as %2C and %25 respectively). OpenText recommends that you escape white space and other non-displaying characters such as %20 and %0A. You can also escape multi-byte UTF8 characters as multiple URL encoded bytes.

Relationships

Name Description
success Original FlowFiles that were successfully processed.
failure FlowFiles that had an invalid or unknown format.