Introduction

The documents produced by connectors and CFS contain information extracted from the source repository. In many cases you might want to add additional information to documents, or modify the structure of the documents, before they are indexed.

To modify documents before they are indexed, use Import Tasks and Index Tasks. These are customizable processing tasks that you can run on documents. You can use these tasks to write documents to disk, manipulate documents, reject documents, and run custom Lua scripts.

Write documents to disk

You can write documents to disk in IDX or XML format. This allows you to view the information that is being indexed, so that you can check the information is being indexed as you expected. If necessary, you can then use other import tasks to manipulate and enrich the information.

Manipulate and enrich documents

You can use import tasks to enrich documents. For example, you can:

  • extract the meaningful content from HTML, and discard advertisements, headers, and sidebars.
  • divide document content into sections. Dividing a document can result in more relevant query results, because IDOL can return a specific part of a document in response to a query.
  • extract speech from audio and video files, and write a transcription of the speech to the document content. IDOL Server can then use the speech for retrieval, clustering, and other operations.

Validate and reject documents

You can reject documents that you do not want to index, for example those that do not appear to contain valid content. When a document is rejected, it is not processed further and is not indexed. However, you can index the document into an IDOL Server that has been configured to handle failed documents.

Run a Lua Script

Lua is an embedded scripting language that you can use to manipulate documents and define custom processing rules. CFS includes Lua functions for manipulating documents and running other tasks.