TextToDocuments

A processor that takes a FlowFile that represents a text file containing multiple documents and splits the text, generating FlowFiles that represent individual documents.

Sometimes you might retrieve text files from a repository that you would prefer to ingest as multiple documents. To divide a file, you specify regular expressions that match the relevant parts of the document. The processor creates one or more child documents, which can all have metadata and content. The documents created by this processor are metadata-only documents.

The processor expects documents to use UTF-8 character encoding. If your documents are not encoded in UTF-8 you can use the configuration parameter SourceEncoding to specify the character set encoding of the source documents, so that they can be converted to UTF-8.

If you are processing HTML you might prefer to use the ContentFromHTML processor.

Properties

Name Default Value Description
IDOL License Service  

An IdolLicenseServiceImpl that provides a way to communicate with an IDOL License Server.

Document Registry Service   A DocumentRegistryServiceImpl controller service that manages and updates a document registry database. This ensures that documents are indexed in the correct order.
Commit Batch Size 100 The processor outputs documents in batches to limit memory use and allow subsequent tasks to begin processing the documents sooner. This property specifies the maximum batch size.
TextToDocs_parameter  

For information about the options that you can set, refer to the documentation for the TextToDocs task in the IDOL Connector Framework Server documentation.

NOTE: The NiFi Ingest processor does not read the FilenameMatchesRegex parameter. OpenText recommends that files are routed to the processor only if you want to process them.

NOTE: In the NiFi Ingest processor, the IncludeMainDocparameter is set to FALSE and is not configurable.

Relationships

Name Description
failure Original FlowFiles that were not processed successfully because there were parsing errors.
success Original FlowFiles that were successfully processed.
extracted FlowFiles representing extracted documents.