Manipulate and Enrich Documents

CFS provides features to manipulate and enrich documents. Enriching a document means adding additional information, or improving the quality and usefulness of the information, before the document is indexed into IDOL. For example, you can:

  • Add additional fields to a document.
  • Extract content from HTML pages, discarding irrelevant content such as headers, sidebars, advertisements, and scripts.
  • Split long documents into multiple sections. This can improve performance when you query IDOL, because IDOL can return a specific part of a document in response to a query.
  • Standardize field names, so that documents that originated from different repositories use the same fields to store the same type of information.
  • Perform Eduction on document fields. Eduction extracts entities from a document, and writes them to specific document fields. An entity can be a word, phrase, or block of information - for example an address or telephone number.
  • Perform analysis on image and video files and add the results to the document. Examples of media analysis include optical character recognition (OCR), face detection and recognition, and object recognition. To analyze media you must have an IDOL Media Server.
  • Extract speech from audio and video files, and add the transcription to the document content. To analyze speech you must have an IDOL Speech Server.
  • Reject documents that do not contain content in a specific language.

The simplest way to manipulate documents is to use the import tasks that are included with CFS. For information about the tasks that are available, see Manipulate and Enrich Documents. You can configure these tasks by modifying configuration parameters in the CFS configuration file.

CFS also supports Lua, an embedded scripting language. You can write Lua scripts to manipulate documents and define custom processing rules. For information about the Lua functions that are provided with CFS, refer to the Connector Framework Server Reference.