Import Tasks
Import Tasks are processing tasks that are performed on documents by CFS, before the documents are indexed into IDOL Server. Import Tasks enable you to manipulate and enrich the documents that are created by CFS.
CFS includes Import Tasks that meet common processing requirements. For example, there are Import Tasks to filter advertisements out of HTML files, or divide document content into shorter sections.
Write documents to disk
You can use the IdxWriter
and XmlWriter
tasks to write documents to disk in IDOL IDX or XML format. This allows you to view the information that is being indexed into IDOL Server, so that you can check the information is being indexed as you expected. If necessary, you can then use other import tasks or custom Lua scripts to manipulate and enrich the information.
The CsvWriter
and JsonWriter
tasks write documents to disk in CSV or JSON format. You can also use the SqlWriter
task to write document metadata and content to disk in the form of SQL "insert" statements, so that you can insert the information from the documents into a database.
Manipulate and enrich documents
You can use import tasks to enrich documents, without needing to write custom scripts. For example, you can:
-
use the
HtmlExtraction
import task to extract the meaningful content from HTML, and discard advertisements, headers, and sidebars. -
use the
Sectioner
import task to divide document content into shorter sections. Dividing a document can result in more relevant query results, because IDOL can return a specific part of a document in response to a query. -
use the
Eduction
import task to run Eduction. - use the
MediaServerAnalysis
import task to run analysis on image, video, and audio files. You can run analysis tasks such as optical character recognition (OCR), object detection, and face recognition. You can extract speech from audio and video files, and write a transcription of the speech to the document content, which IDOL Server can use for retrieval, clustering, and other operations.
Validate and reject documents
You can use import tasks to reject documents that you do not want to index into IDOL server. For example, the BadFilesFilter
task rejects documents that do not contain valid content. When a document is rejected, it is not processed further and is not indexed into IDOL. However, you can index the document into an IDOL Server that has been configured to handle failed documents.
Run a Lua Script
The Lua task runs a Lua Script. Lua is an embedded scripting language that you can use to manipulate documents and define custom processing rules. CFS includes Lua functions for manipulating documents and running other tasks. For example, you can add, modify, or remove fields and their values.
Configure Import Tasks
Import tasks are configured in the [ImportTasks]
section of the CFS configuration file.
You can run Import Tasks before or after documents are processed by KeyView. Pre import tasks run before KeyView processing. Post Import tasks run after KeyView processing.