Key Knowledge Discovery Concepts
The Knowledge Discovery product family contains many different components, and a large range of features and functions. This section provides some information about the basic concepts in Knowledge Discovery.
Unstructured Data
Digital data generally falls into two forms. Structured data is well-organised and easily searchable by computers, such as relational databases. Unstructured data is the more common human-readable information, such as documents, video, audio, and image files. Knowledge Discovery can manage both forms of data, but its greatest strength is its ability to extract meaning and useful insights from the unstructured data.
Knowledge Discovery Text
Knowledge Discovery Text is the part of Knowledge Discovery that provides search, analytics and data enrichment for unstructured text sources. Knowledge Discovery Text includes the main text index, which allows you to search your text-based data. It also includes data enrichment such as categorization, data clustering, and entity extraction (Eduction), which finds useful snippets of information such as names and addresses), and categorization.
Knowledge Discovery Rich Media
Knowledge Discovery Rich Media is the part of Knowledge Discovery that provides analytics and data enrichment for multimedia sources. Rich Media support is provided by Knowledge Discovery Media Server.
Media Server analyzes video files and streams, images, and audio to extract information about their content. It can run analysis operations such as face recognition, number plate recognition, speech-to-text, and speaker identification.
File Content Extraction
File Content Extraction is the part of Knowledge Discovery that processes documents in their native format to get the text content and export for easy viewing. It is embedded into other Knowledge Discovery components, such as NiFi Ingest, and the View component.
In NiFi Ingest, File Content Extraction performs format detection, which reads the file to detect the correct format, so that you can process the file correctly. For text-based file formats, File Content Extraction then finds and extracts the text. It can also extract files from containers (such as zips, files with embedded images or imported subfiles, or emails with attachments), and process the subfiles.
In View, File Content Extraction renders an original document into HTML format, which you can use for easy viewing in a Web browser, for example to have a document preview in your search application.
File Content Extraction is also available as the following SDKs, which allow you to embed File Content Extraction functionality in your own custom applications:
-
The Filter SDK detects and extracts text content from a variety of files.
-
The Export SDK renders original document formats into HTML or XML for easy Web viewing.
-
The Viewing SDK renders files for viewing in Windows applications.
-
The Panopticon SDK allows you to decrypt files that have been protected with Microsoft Azure Rights Management System (RMS).
Knowledge Discovery Ingest
Knowledge Discovery Ingest is the part of Knowledge Discovery that retrieves your data from your various repositories.
You retrieve data from your repositories (such as databases, file systems, Web sites, and email) by using Knowledge Discovery Connectors. The connector contacts and retrieves data from these repositories.
Ingest uses an embedded version of File Content Extraction to detect the file format, and process it accordingly. For example, it might extract text from text-based files, perform Optical Character Recognition (OCR) on images to find text, or use speech-to-text on video to convert audio to text.
In a wider Knowledge Discovery setup, the ingest components can send the text data to your data index. You can also use ingest without a text index to extract and process content from your repositories.
Knowledge Discovery Ingest and its connectors are available in two formats:
-
NiFi Ingest is a newer format based on Apache NiFi, where the connectors and ingest component are all available in one place, accessible with a Web user interface. You can use the interface to create complex workflows and manage all your connectors, data enrichment, and document flows.
-
Connector Framework Server (CFS) is the older, ACI server-based ingest component (see ACI Servers). In this case, all the connectors are also ACI servers, and it performs much of its data enrichment by connecting to other ACI servers, such as Media Server and Category.
In addition to retrieval, many connectors have additional features such as:
-
view and browse all the documents in a particular repository.
-
access the repository to view the original documents.
-
retrieve documents and push them to a different repository.
ACI Servers
ACI servers are Knowledge Discovery components that use a common interface, the ACI API. ACI servers use a Web API framework that allows you to send HTTP requests (known as actions), and return XML responses.
ACI servers share a lot of common configuration concepts, and many standard actions.
Distributed Systems
It is often advisable or useful to install a Knowledge Discovery system across multiple servers. This kind of setup is known as a distributed system.
For Knowledge Discovery Text, there are two ACI server components that manage indexing and querying over a distributed index. These are the Distributed Index Handler (DIH) and Distributed Action Handler (DAH).
You can also use other common networking tools to distribute the stateless parts of your Knowledge Discovery system, for example for load-balancing.
Knowledge Discovery Security
Knowledge Discovery provides methods to secure access to your data, and communications between different users and components.
-
User authentication. At the front end, users must log on before they can query Knowledge Discovery. The Community component can authenticate users against an existing directory service, such as Microsoft Active Directory.
-
Document security. Many data repositories have security features that apply permissions to files, so that only authorized users can view them. Document security maintains these access restrictions in your Knowledge Discovery index, so that queries return only documents that the logged-in user has permission to view.
-
Index Encryption. You can encrypt your Knowledge Discovery text index on disk, to prevent unauthorized access. For more information about index encryption, refer to the Content Component Help.
-
Secure communications. Knowledge Discovery supports TLS/SSL for secure communications. Knowledge Discovery also supports GSSAPI for authentication and secure communications.