NiFi Ingest
NiFi Ingest is a set of components for data retrieval and enrichment, that run within an open-source framework called Apache NiFi. NiFi Ingest provides a new way to ingest data into Knowledge Discovery.
Ingest Data into Knowledge Discovery
Knowledge Discovery is a platform that helps you get the most benefit from large quantities of information. Before you can start using Knowledge Discovery, you need to index your data. Your organization is likely to have data in many different formats, distributed across many different kinds of repository. The process of extracting information from repositories and preparing it for indexing into the text index is called ingestion.
Knowledge Discovery Connectors connect to repositories and retrieve content. There are connectors for over 150 types of repository, including:
- Local and network file systems.
- Web sites and social media feeds.
- Document and content management systems such as Microsoft SharePoint.
- E-mail servers such as Microsoft Exchange.
- Database servers such as Microsoft SQL Server, Oracle, and MySQL.
- Cloud services such as OneDrive and Google Drive.
After connectors have retrieved information from a repository, but before it is indexed, the information is usually processed and enriched.
Typically, files that contain text are filtered by File Content Extraction, which extracts the text so that Knowledge Discovery does not need to process information in its native format. Media files, such as images or audio recordings can be sent to a Media Server which can perform media analysis such as optical character recognition or speech-to-text. The information can be standardized, so that information that originated in different repositories is stored in the same document fields and can be used more effectively. You can discard irrelevant content so that it does not pollute the index.
Use NiFi Ingest
NiFi Ingest helps you use Apache NiFi to build a custom ingestion pipeline for Knowledge Discovery.
NiFi Ingest, combined with the Apache NiFi framework, provides features that:
- Improve visibility. Apache NiFi provides a graphical interface that you use to build your ingestion pipeline. When you start ingesting documents, you can use the same interface to monitor processing speed and queue sizes, and identify bottlenecks and ingestion errors. The Apache NiFi framework also has built-in support for tracking documents through the ingestion process.
- Improve customization. You can create an ingestion pipeline that is customized to your use case, with less need for custom Lua scripts.
- Improve control. You can stop parts of the ingestion pipeline and make changes, without stopping the entire system.
- Improve performance and reliability. Apache NiFi can scale to process extremely large volumes of data. You can distribute processing across multiple NiFi instances.
NiFi Ingest provides processors (connectors) that connect to your data repositories. The NiFi Ingest distribution also provides processors that enrich your data. For example, the distribution includes processors for File Content Extraction and a processor for sending files to a Media Server for further analysis.
After retrieving and enriching your data, NiFi Ingest can index the resulting documents into your text index.