FilterBadContent

A processor to discard FlowFiles with unacceptable document content. You can use this processor to discard FlowFiles where the document content is not suitable for indexing.

The processor filters documents based on:

  • The percentage of binary content.
  • The percentage of symbolic content. Symbolic characters are defined as any character between U+2000 and U+2FFF.
  • The average word length.

Properties

Name Default Value Description
IDOL License Service  

An IdolLicenseServiceImpl that provides a way to communicate with an IDOL License Server.

Maximum Binary Content Percentage 10 The maximum percentage of the content that can be binary.
Maximum Symbolic Content Percentage 10 The maximum percentage of the content that can be symbolic.
Minimum Average Word Length 3 The minimum number of characters that is acceptable for the average word length.
Maximum Average Word Length 9 The maximum number of characters that is acceptable for the average word length.
Average Word Length content segment size (chars) 1000 The number of characters in the document content to use for checking the average word length.

Relationships

Name Description
good FlowFiles for which the document content is considered acceptable, based on the properties that you set.
bad FlowFiles for which the document content is not acceptable, based on the properties that you set.
failure FlowFiles that had an invalid or unknown format.