Ingest Data in CSV Format
Many systems export data in CSV format. This section describes how to ingest comma-separated data into IDOL using NiFi Ingest.
The steps in this section assume that:
- You have one or more files each of which contains comma-separated data that should be parsed into one or more IDOL documents.
- The data is not necessarily in IDOL document format (there might not be a reference or content field).
To ingest data in CSV format
-
Add a GetFileSystem processor to your data flow to retrieve the CSV file(s).
- Configure the location of your CSV files by setting the property "Directory Paths". If the folder contains other files, you could also set "File name pattern" to
*.csv
. - If you are running a NiFi cluster, set the dynamic property
adv:FlowFileEmbedFiles
to TRUE. For more information about this property, see Advanced Connector Properties.
- Configure the location of your CSV files by setting the property "Directory Paths". If the folder contains other files, you could also set "File name pattern" to
- Add a ConvertCSVToDocuments processor to the data flow.
- Connect the "success" relationship of the GetFileSystem processor to the ConvertCSVToDocuments processor.
-
Configure the ConvertCSVToDocuments processor.
-
Right-click the processor and click Configure.
The Configure Processor dialog box opens.
- Click the Properties tab.
-
Set the following properties:
Use CSV Header Row If the first line in the CSV file contains a list of field names, set this to TRUE. If the first line in the CSV file is the first line of data, set this to FALSE. CSV Field Names To override the list of field names in the CSV file, or if the CSV file does not contain any field names, enter a comma-separated list of field names to use. Reference Field The name of the field (in the CSV file header row or CSV Field Names property) to use as the document reference. If there isn't a suitable field, you can set this property to an empty string and set Base Reference instead. Content Field The name of the field (in the CSV file header row or CSV Field Names property) to use as the document content. If you don't want to populate the document content, set this property to an empty string. Base Reference If the CSV file doesn't contain a suitable reference field, you can specify a base reference. The documents extracted from the CSV file will be given references in form BaseReference:N
whereN
is the line number in the source CSV file. - Click APPLY.
-
-
Connect the "extracted" relationship of the ConvertCSVToDocuments processor to your ingestion pipeline.
TIP: After they are processed, the original FlowFiles that were routed to the ConvertCSVToDocuments processor are routed to the "success" relationship.
To avoid indexing documents representing the original CSV files, you could auto-terminate this relationship. However, if you are using a document registry service to ensure that documents are indexed in the correct sequence, route the "success" relationship to an UnregisterDocument processor. For more information about the document registry service, see Index Documents in the Correct Sequence.
-
Start the GetFileSystem and ConvertCSVToDocuments processors.