The Ingestion Process

The following chart provides a summary of the ingestion process.

Documents are submitted to Connector Framework Server through the ingest action. If the document has metadata only, CFS runs any processing tasks that have been configured and the document is then ready for indexing. If the document has an associated file then the ingestion process depends on the file format.

  • All files apart from IDX and XML. Most documents that have an associated file are added to the import queue so that the information in the file can be extracted by File Content Extraction or other processing tasks. For information about the import process, see The Import Process.
  • IDX files. An IDX file contains one or more documents in IDX format, so CFS attempts to parse the file. If parsing is successful then the documents are returned to the ingest queue as metadata-only documents. If parsing is not successful then CFS adds the document to the import queue so that the IDX file is processed by File Content Extraction. Parsing an IDX file is preferable to processing it with File Content Extraction, because although File Content Extraction can extract the text, it cannot extract the structure information that divides the text into separate documents, content sections, and metadata fields.
  • XML files. Many systems export information in XML format and CFS has features to help you convert XML into Knowledge Discovery documents.

    CFS can run a transformation on an ingested XML file. This is an optional step but can be useful in cases where your XML files do not resemble Knowledge Discovery documents or you are processing XML from many sources and the files have different schemas. You can configure any number of transformations and CFS runs the first transformation where the ingested XML matches the specified schema. You can also configure a default transformation that CFS runs when an XML file does not match any of your schemas. When a transformation is configured but is not successful, CFS adds the document to the import queue so that the XML is processed by File Content Extraction.

    After an XML transformation is successful or when transformation is not configured, CFS attempts to convert the XML into Knowledge Discovery documents. The conversion is performed by mapping elements in the XML to Knowledge Discovery documents and document fields. If the conversion is successful the resulting documents are returned to the ingest queue as metadata-only documents. If the conversion does not result in any Knowledge Discovery documents but the XML was transformed after matching a schema, CFS does not consider this as a failure and does not index any documents. Otherwise, CFS adds the document to the import queue so that the XML is processed by File Content Extraction.

    Parsing an XML file is usually preferable to processing it with File Content Extraction, because although File Content Extraction can extract the text it does not preserve the structure information (the XML tags are discarded).