ContentIntoSections
Splits IDOL document content into sections.
You might have documents with large amounts of content. For example, a 1000-page PDF file could be indexed as a single document. Dividing the content of long documents into sections can result in more relevant search results, because IDOL can return a specific part of a document in response to a query.
Properties
Name | Default Value | Description |
---|---|---|
IDOL License Service |
An IdolLicenseServiceImpl that provides a way to communicate with an IDOL License Server. |
|
Section mandatory separators |
A comma-separated list of separators. The document content is split at every occurrence of any of these separators. The current section will end immediately after the separator and a new section is created for the following content. A separator string can be specified as a fixed string or as a regular expression. A separator is treated as a regular expression if it begins with an open parenthesis “ |
|
Section separators | Paragraph breaks, punctuation followed by white space, other suitable separators. |
A comma-separated list of separators. The document content might be split at an occurrence of one of these, if the pages either side of the break location are of appropriate sizes. Separators towards the left of a comma-separated list have priority over those toward the right. If you do not set this property, a default list of separators is used. For example, you might prefer to split content on paragraph breaks A separator string can be specified as a fixed string or as a regular expression. A separator is treated as a regular expression if it begins with an open parenthesis “ |
Section minimum bytes | 1500 | The minimum number of bytes preferred for a section. This is not a hard limit, but section sizes are generally kept above this. |
Section maximum bytes | 3000 | The maximum number of bytes preferred for a section. This is not a hard limit, but section sizes are generally kept below this. |
Maximum sections | 10000 | The maximum number of sections to create. If the maximum number of sections is reached, the last section contains all of the remaining content. |
Temp Directory | temp | A path to a location where content data can be written if required. |
NOTE: When you set the Section mandatory separators and Section separators properties, the comma (,
) and percent (%
) characters must be URL-escaped (as %2C
and %25
respectively). Micro Focus recommends that you escape white space and other non-displaying characters such as %20
and %0A
. You can also escape multi-byte UTF8 characters as multiple URL encoded bytes.
Relationships
Name | Description |
---|---|
success | Original FlowFiles that were successfully processed. |
failure | FlowFiles that had an invalid or unknown format. |