What is Metadata?

Documents may contain information about the document itself: we call this metadata. For instance, a raster image file will contain metadata recording the image's width and height; a word processing document may contain metadata recording the document's author and title. Metadata can be represented by key-value pairs. For instance, a document's title can be represented as the key "Title" and the value "Annual Report". We refer to a single metadata key-value pair as a metadata field. You can access document metadata using the Filter interface function fpGetMetadataList().

Containers (documents with subfiles) can contain metadata about their subfiles. For instance, a .pst file is a container that can have multiple email messages as subfiles. A .pst may contain metadata stating the "To", "From", etc. fields of these subfiles. You can access subfile metadata using the Extract interface function fpGetSubFileMetadataList(). fpGetMetadataList() and fpGetSubFileMetadataList() have different function signatures, but both output the KVMetadataList struct, so the output from both functions can be treated equivalently.

Representing Metadata in KeyView

KeyView uses the structures KVMetadataList and KVMetadataElement to store metadata.

KVMetadataElement

The KVMetadataElement structure represents one metadata field from the document. It contains a string representation of the field’s key, and the field’s value stored in an appropriate type.

KVMetadataList

The KVMetadataList structure represents a list of KVMetadataElement objects, allowing you to iterate over them. The order of elements within the list is not significant and may change in the future.

Understanding Metadata Fields in KeyView

Standardized metadata fields

Common metadata fields such as "Title", "Author", and "Subject" exist in many different file formats, but can be stored in different ways. For instance, one raster image format may store the image width as a key-value pair with key Width. Another format may store the image width in bytes 16-19 of the file. You might want to access the width of all raster images in the same way, regardless of the source format. KeyView provides standardized field names for some common metadata so that you can do this, meaning you can use the same code to handle many different file formats.

When KeyView understands the meaning of a metadata field in a document, it outputs that data in a standardized field. Standardized fields are represented as KVMetadataElement objects with an eKey set to a value other than KVMetadataKey_Other. For standardized fields, the following is true:

  • eKey is the standardized field key, which indicates the meaning of the field.

  • pKey is uniquely determined by eKey. If you are handling the value of a standardized field based on its eKey, you can ignore pKey. pKey is provided so that standardized fields can optionally be handled in the same way as non-standardized fields.

  • pValue is converted to a standard type, and, where appropriate, standard units.

Each standardized field is guaranteed to occur at most once in the metadata output. For example, the metadata output will contain zero or one KVMetadataElement objects with eKey equal to KVMetadataKey_Title.

For a full list of the standardized metadata fields, see Standardized Metadata Fields.

Non-standardized metadata fields

Non-standardized fields include user created fields, or fields that are specific to that type of document. KeyView deals with these in the following way: when a document contains a string representation of a field's key, pKey is set to that string. Otherwise, KeyView generates a value of pKey to describe the field. The field's value is stored in pValue using an appropriate data type. The eKey is set to KVMetadataKey_Otherto signify that it is a non-standardized field.

Standardized value units

Many values are scalar values, and can be understood without reference to units – for example, page count. However, some values can be expressed in multiple units, and different file formats might store that information in different units – for example, image width could be stored in pixels, twips, or another graphic unit. When a field is standardized, its value is converted to the documented units, so all fields with a certain eKey outputs their pValue with the same units.

The exception to this is KVMetadataValue_String values, where KeyView will output the string as it appears in the document. For example, depending on how the author field is stored in the document, it might be output as “John Smith”, “Smith, John”, “J. Smith”, or another form. The string will be encoded in the target encoding provided to fpInit().

Duplicated fields

It is possible for the same logical piece of metadata to appear multiple times in the KeyView metadata output for a document:

  • If a field is output as a standardized field, it may also be output as a non-standardized field with its original key/value.

  • If a field is stored in multiple instances in a document, KeyView may output a field for each instance. For example, if a raster image stores its width in both Exif and XMP metadata, KeyView may output a field for each.

  • KeyView may output multiple fields with different keys for the same piece of metadata to maintain backwards compatibility.

Absence of fields

If a field is not present in KeyView’s metadata output, that does not imply a value for that field. For instance, if the standardized field KVMetadataKey_CharacterCount is not present, that does not imply that the document contains zero characters: instead, it indicates either that the document does not contain a “character count” metadata field, or that KeyView does not support that field for that format. If a field is not present in the document, KeyView will not attempt to construct that field, even where that field could be calculated from other information. For example, if the word count is missing from a word processing document, KeyView will not attempt to calculate it by counting the number of words in a file.

Similar standardized fields

Some standardized fields have similar meanings, e.g. KVMetadataKey_Author and KVMetadataKey_Artist: both represent people who in some way created the content of the document. KeyView does not attempt to interpret or define the meaning of these fields: If a document’s metadata contains an author field, KeyView will attempt to standardize that as a KVMetadataKey_Author field. Equivalently, it will attempt to standardize a document’s artist field as a KVMetadataKey_Artist field. KeyView will not attempt to determine if an author is an artist or vice versa.

Validation of fields

KeyView outputs metadata fields based on metadata stored within the document, but does not attempt to validate whether these fields are correct. For example:

  • KeyView will not check the word count stored in the metadata is the same as the actual number of words present in the document.

  • KeyView will not attempt to validate that the MIP Label stored in the document can be used to decrypt the document.

  • KeyView will not check that the signature in a signed executable is authentic.

  • KeyView will not check that a PDF marked as being PDF/A conformant actually conforms.

Mail metadata

The metadata for an e-mail message (the header fields such as "To", "CC", "Subject", and so on) are typically stored in the mail container (such as an MSG or EML file). To access this metadata you can call the function fpGetSubFileMetadataList(), in the Extract API.

The message body and any attachments are considered by KeyView as subfiles of the container. When you extract the message body, KeyView includes the header fields (by default). If you do not want to include this information, set the flag KVExtractionFlag_ExcludeMailHeader when you call fpExtractSubFile(). You might want to do this if you have already accessed the metadata and do not want to process it again.

Backwards compatibility

To help you migrate from fpGetOLESummaryInfo to fpGetMetadataList, KeyView reproduces the output of fpGetOLESummaryInfo in the output of fpGetMetadataList. For each KVSummaryInfoEx object sumInfo that would be output by fpGetOLESummaryInfo, the output of fpGetMetadataList will contain a KVMetadataElement object metaElem such that metaElem.pKey contains the name in sumInfo.pcType, and and metaElem.pValue contains the value represented by sumInfo.data. Where KeyView can standardize the field represented by sumInfo, the output of fpGetMetadataList also contains that standardized field.

To help you migrate from fpGetSubFileMetaData() to fpGetSubFileMetadataList(), KeyView reproduces the default and all metadata sets output by fpGetSubFileMetaData() in the output of fpGetSubFileMetadataList(). The default metadata set is output by fpGetSubFileMetaData() when the metaNameCount member of the KVGetSubFileMetaArgRec argument is set to 0, and the all metadata set is output when metaNameCount is set to -1. For each valid KVMetadataElem object metaElemOld in the default and all output of fpGetSubFileMetaData, the output of fpGetSubFileMetadataList() will contain a KVMetadataElement object metaElemNew such that metaElemNew.pKey contains the name in metaElemOld.strType, and metaElemNew.pValue contains the value represented by metaElemOld.data. Where KeyView can standardize the field represented by metaElemOld, the output of fpGetMetadataList also contains that standardized field.

Metadata Examples

If you want to process both standardized and non-standardized metadata fields, you can loop through the KVMetadataList without checking the eKey member – both standardized and non-standardized metadata can be handled in the same way.

However, standardization allows you to handle particular metadata fields differently. Below are some illustrative examples of ways you might use standardized fields to act on documents.

//Ignore non-standardized fields
if(element->eKey == KVMetadataKey_Other)
{
    continue;
}		
//Ignore small images
if(element->eKey == KVMetadataKey_ImageWidth)
{
    int64_t width = *(int64_t*)element->pValue;
    if(width < 200)
    {
        break;
    }
}
//Search for documents created by a certain company
if(element->eKey == KVMetadataKey_Company)
{
    KV_String company = *(KV_String*)element->pValue;
    if(strncmp("OpenText", company.pcString, company.cbSize) == 0)
    {
        return pathToInputFile;
    }
}		
//Find all documents created since the beginning of 2022
if(element->eKey == KVMetadataKey_Created)
{
    int64_t created = *(int64_t*)element->pValue;
    //2022-01-01 00:00:00 UTC in Windows File Time
    if(created > 132854688000000000)
    {
        return pathToInputFile;
    }
}