Introduction

The following image shows the completed ingestion pipeline that is described in the following sections.

The pipeline includes the following steps:

  • File System Connector. The File System Connector retrieves data from a local or network file system. The connector produces a NiFi FlowFile to represent each file that is retrieved from the file system.
  • Extraction. Extracts files from containers. For example, if a FlowFile represents a zip archive, File Content Extraction extracts the contents of the archive.
  • Filtering. Filtering extracts the text from a file and adds it to the document content. The text can then be indexed, which means that Knowledge Discovery does not need to process the data in its original format.
  • Field Standardization. Field standardization modifies documents so that they have a consistent structure and consistent field names. You can use field standardization so that documents which originated from different connectors use the same fields to store the same type of information.
  • Remove Document Part. This step removes the binary content or file reference from a FlowFile. Removing file references allows NiFi to delete temporary files.
  • Indexing. Documents are indexed into a Content component.