Connector Framework Server

Connector Framework Server (CFS) processes the information that is retrieved by connectors, and then indexes the information into the Content component.

CFS reads the content and metadata from files and records that are retrieved by connectors, and writes this information to documents for indexing. When connectors send documents to CFS, they contain only metadata extracted from the repository, such as the location of a file or record that the connector has retrieved. CFS extracts the file-specific metadata and file content from the file and adds it to the document. This allows Content to search and extract meaning from the information contained in the repository, without needing to process the information in its native format.

CFS also provides features to manipulate and enrich documents before they are indexed. This means that you can manipulate the data that is indexed into Content, and improve the quality of the information. CFS includes customizable import tasks that you can run, and supports the Lua scripting language so that you can write your own tasks and develop custom processing rules. For example, you can manipulate the fields and field values in each document.

A single CFS can process information from any number of Connectors. For example, a CFS might process files retrieved by a File System Connector, web pages retrieved by an HTTP connector, and e-mail messages retrieved by an Exchange OData Connector.

Extract File Content, Metadata, and Subfiles

CFS uses File Content Extraction to extract meaningful information from the files or records retrieved by connectors. File Content Extraction can extract the file content, metadata, and subfiles from over 1,000 different file types.

  • File content is the main content of a file, for example the body of an e-mail message.

  • Metadata is information about a file itself, for example the sender of an e-mail message or the date and time when it was received.

  • Subfiles are files that are contained with the main file. For example, an e-mail message might contain embedded images or attachments that you want to index.

Standardize Fields

CFS can run Field Standardization, which renames document fields so that they follow a standard naming scheme. You can use field standardization so that documents indexed into the Content component from different connectors use the same fields to store the same type of information. Your CFS installation includes a file named dictionary.xml, which lists the fields renamed during field standardization, and the standardized names.

Manipulate and Enrich Documents

CFS provides features to manipulate and enrich the documents that are indexed into Content. This means that you can add additional information to the documents, and improve the quality of the information, before the documents are indexed. You can manipulate documents with Import Tasks (predefined processing tasks that are provided by CFS), and with Lua scripts.

You can use CFS Import Tasks to:

  • Add additional fields or manipulate existing fields.

  • Write a copy of documents to an IDX or XML file on disk. This allows you to confirm that files are being processed as expected, and identify whether additional fields or data needs to be added to the documents before they are indexed.

  • Perform Eduction on document fields.

  • Perform Optical Character Recognition (OCR) on images and add the text to the document content.

  • Extract speech from audio and video, and add the transcription to the document content.

  • Extract content from HTML pages, discarding irrelevant content such as headers, sidebars, advertisements, and scripts.

  • Split long documents into multiple sections.

  • Check that documents contain content in a specific language, and discard those that contain binary or symbolic content.

Index Documents into the Content component

After CFS finishes processing documents, it can automatically index them into a Content component, or send them to a Distributed Index Handler (DIH) so that they can be distributed across multiple Content components.