Knowledge Discovery Ingest

Knowledge Discovery Ingest refers to the tools that allow you to retrieve data from your external repositories, process them, and index them into Knowledge Discovery.

The ingest process includes:

  • Connectors. Each supported repository has a connector, which can use the appropriate repository API to view and extract your data. There is a connector for each repository, rather than file type. For example, the file system connector retrieves data from a file system, and the Web connector retrieves data from web sites, and both connectors retrieve all file types from the repository.

  • File Content Extraction. File Content Extraction processes documents in their native format. It can detect the formats of different files, so that you can route them for appropriate processing. It also extracts text from many file formats, so that you can index the text into Knowledge Discovery.

  • Additional Processing. You can include many additional processes, such as Eduction, or media analysis, to enrich your data and extract more useful content from different files.

  • Workflow Management. Knowledge Discovery manages the documents going through the system, routes documents to different places according to rules that you can configure, and manages the process of indexing into your Content component indexes.

Knowledge Discovery provides two different ways to manage ingestion:

  • Knowledge Discovery NiFi Ingest, based on Apache NiFi, uses a graphical user interface to help you easily set up and configure your ingest chain. Connectors, File Content Extraction, and many additional enrichment and processing components are available as NiFi processors, so you can run your whole ingest process in one place.

  • Connector Framework Server (CFS), is an older system based on ACI servers. In this case, the connectors are available as additional ACI servers. File Content Extraction and Eduction are embedded in the CFS component, but you must install additional ACI servers to include other processing, such as media analysis.

NiFi Ingest

OpenText recommends that you use NiFi Ingest to manage your ingest workflow. NiFi is generally faster than CFS, as well as easier to configure and manage, and more extendable.

To use NiFi Ingest, you install Apache NiFi and the NiFi Ingest package. You can then use the NiFi user interface on a Web browser to set up a workflow. You can easily create quite complex ingest chains to manage all your different data formats and repositories, and index the resulting documents into Knowledge Discovery.

You can include as many different Knowledge Discovery processors as you want to use in your ingest process. Many are available as part of the standard NiFi ingest installation, and others are available as downloads with the associated Knowledge Discovery components, which you simply add to your NiFi installation directory.

For a small system, you might only need to install a single NiFi Ingest instance, For a larger system, or where you need to ensure reliability, you can create a NiFi cluster, with multiple NiFi Ingest components with the same workflow. You can use this option to:

  • provide failover if one of the instances becomes inaccessible.

  • improve throughput in large systems, so that you can run the same ingest process on multiple servers, to improve the resources available.

For more information, refer to the NiFi Ingest Help.

Connector Framework Server

The Connector Framework Server (CFS) is an ACI server, which you configure with a configuration file.

In this case, the connectors are also separate ACI server components, which you must configure separately. Similarly, if you want to use media analysis, you must install Media Server separately.

The number of connectors and CFS components that you need depends on the number of repositories you want to collect data from, and the amount of processing that you want to do. For example:

  • To index content from a few small repositories, where the repositories do not change very often, you might set all your connectors up to send content to the same CFS, which indexes into the Content component.

  • To index content from repositories that change regularly (generating a large number of new files and updates), or very large repositories, you might use a CFS for each individual connector.

For more information, see Install Knowledge Discovery, and refer to the Connector Framework Server Help, as well as the Help for the individual connectors.