Index Documents into Vertica
CFS can index documents into Vertica, so that you can run queries on structured fields (document metadata).
Depending on the metadata contained in your documents, you could:
- Investigate the average age of documents in a repository. You might want to answer questions such as: How much time has passed since the documents were last updated? How many files are regularly updated? Does this represent a small proportion of the total number of documents? Who are the most active users?
- Find the number of e-mail messages sent to your sales or support teams each week, and calculate the average response time to customer queries.
Prerequisites
- CFS supports indexing into Vertica 7.1 and later.
- You must install the appropriate Vertica ODBC drivers (version 7.1 or later) on the machine that hosts Connector Framework Server. If you want to use an ODBC Data Source Name (DSN) in your connection string, you will also need to create the DSN. For more information about installing Vertica ODBC drivers and creating the DSN, refer to the Vertica documentation.
New, Updated and Deleted Documents
When documents are indexed into Vertica, CFS adds a timestamp that contains the time when the document was indexed. The field is named VERTICA_INDEXER_TIMESTAMP
and the timestamp is in the format YYYY-MM-DD HH:NN:SS
.
When a document in a data repository is modified, CFS adds a new record to the database with a new timestamp. All of the fields are populated with the latest data. The record describing the older version of the document is not deleted. You can create a projection to make sure your queries only return the latest record for a document.
When a connector detects that a document has been deleted from a repository, CFS inserts a new record into the database. The record contains only the DREREFERENCE
and the field VERTICA_INDEXER_DELETED
set to TRUE
.
Fields, Sub-Fields, and Field Attributes
Documents that are created by connectors and processed by CFS can have multiple levels of fields, and field attributes. A database table has a flat structure, so this information is indexed into Vertica as follows:
- Document fields become columns in the flex table. An IDOL document field and the corresponding database column have the same name.
- Sub-fields become columns in the flex table. A document field named
my_field
with a sub-field namedsubfield
results in two columns,my_field
andmy_field.subfield
. - Field attributes become columns in the flex table. A document field named
my_field
, with an attribute namedmy_attribute
results in two columns,my_field
holding the field value andmy_field.my_attribute
holding the attribute value.