External File Storage
In NiFi Ingest, FlowFiles that represent Knowledge Discovery documents store data in several ways - as FlowFile attributes, as XML metadata, as plain-text content, and in associated files. The files associated with a document might be included in the FlowFile as:
- a path to a file stored on a local or network file system (a
contentfilename
part) - embedded binary content (a
contentfile
part) - a reference to a file stored by an external storage provider such as Amazon S3, Azure Blob Storage, or Google Cloud Storage (an
externalfile
part)
Each of these options has its own advantages:
- Adding a file to a FlowFile by including a local file path usually provides the best performance. By default, Knowledge Discovery connectors generate FlowFiles with files included as paths.
- Using one of the other options might be necessary if you are running a cluster, to ensure that the file content is accessible from all of the NiFi instances in the cluster.
- Using external file storage might be advantageous when you are running NiFi using cloud compute and want to use cloud storage as well.
NiFi Ingest provides a processor, named ConvertDocumentFile, so that you can convert from one type of storage to another. If you want to use an external storage provider, you must first set up an IdolFlowFileServiceImpl controller service, which manages the connection to the external storage.
Performance Considerations
Using external file storage can have a significant performance impact, especially if your dataflow contains a series of processors that each require the file content. Each processor would need to make a request to the external storage provider to download the file.
To mitigate the performance impact in this case, your dataflow could use a ConvertDocumentFile processor to download files from external storage and replace externalfile
parts of FlowFiles with contentfilename
parts. You could then route the FlowFiles through your other processors, before using another ConvertDocumentFile to put the file back into the external storage system.