Knowledge Discovery Ingest

Knowledge Discovery Ingest is a collection of tools for data retrieval and enrichment. Knowledge Discovery Ingest can prepare data for indexing into your Content component, so that you can search and analyze your information, but you can also use Knowledge Discovery Ingest as the first step in any data processing task and send the information to other systems.

Knowledge Discovery Ingest includes connectors, which retrieve information from a specific type of repository. There are connectors for over 150 repositories, including:

  • Local and network file systems.
  • Web sites and social media feeds.
  • Microsoft 365 (Office 365) services such as OneDrive and Teams.
  • Document and content management systems such as Microsoft SharePoint.
  • E-mail servers such as Microsoft Exchange.
  • Communication tools such as Slack.
  • Database servers such as Microsoft SQL Server or MySQL.
  • Cloud services such as Amazon (AWS) S3, Dropbox, or Google Drive.

Connectors produce documents. A document represents a single item that exists within a repository - such as a file in a file system, a page from a web site, or a message from a chat application. The documents that are produced by connectors contain metadata that was extracted from the repository, such as the location of the item and an Access Control List (ACL) that describes who is permitted to view it. The presence of an ACL allows the Content component to restrict access to information, maintaining the security permissions defined in the source repository without impacting query performance, even when you have millions of documents.

Each document produced by a connector can, optionally, have an associated file. For example, a document produced by a File System Connector to represent a file has an associated copy of the file. A document produced by a Web Connector to represent a web page can have an associated file containing the HTML source of the page. These binary files are not indexed into your Content engine, but the information they contain can be used by other components to enrich the document.

Connectors send documents to other components for enrichment and further processing. For example you can:

  • use File Content Extraction to extract subfiles from containers, such as ZIP archives. Subfiles are files that are contained within other files. For example, an e-mail message might contain attachments, or a Microsoft Word document might contain an embedded image or spreadsheet.
  • use File Content Extraction to filter the text from files, so that you can access the text without needing to process a file in its native format.
  • use Rich Media components to analyze media files, adding information from Optical Character Recognition, Speech-To-Text, or Face Recognition to a document.
  • use Eduction to locate or redact Personally Identifiable Information (PII).
  • run many other processing tasks, including running a custom Lua or Python script to process the data however you wish.

OpenText recommends that you deploy your connectors and other ingest components in Apache NiFi. Apache NiFi is an open-source framework that can help you visualize and configure the flow of data in a system. Many Knowledge Discovery features can be deployed in Apache NiFi - including connectors, File Content Extraction, and rich media analysis. Using Apache NiFi makes all of these components easier to deploy, configure, and manage.

Alternatively, you can deploy connectors as standalone ACI servers. If you deploy connectors in this way you might also need to deploy an Apache NiFi system or a Connector Framework Server (CFS) to manage the enrichment and indexing tasks. If you use CFS and want to run media analysis, you would also need to install a standalone Media Server.

For more information about the Knowledge Discovery platform, refer to the Getting Started Guide.