What is Metadata?
Documents may contain information about the document itself: we call this metadata. For instance, a raster image file will contain metadata recording the image's width and height; a word processing document may contain metadata recording the document's author and title. Metadata can be represented by key-value pairs. For instance, a document's title can be represented as the key "Title" and the value "Annual Report". We refer to a single metadata key-value pair as a metadata field.
Containers (documents with subfiles) can contain metadata about their subfiles. For instance, a .pst file is a container that can have multiple email messages as subfiles. A .pst may contain metadata stating the "To", "From", etc. fields of these subfiles. You can access subfile metadata using the Extract interface function fpGetSubFileMetadataList().
Representing Metadata in KeyView
KeyView uses the structures KVMetadataList and KVMetadataElement to store metadata.
KVMetadataElement
The KVMetadataElement structure represents one metadata field from the document. It contains a string representation of the field’s key, and the field’s value stored in an appropriate type.
KVMetadataList
The KVMetadataList structure represents a list of KVMetadataElement objects, allowing you to iterate over them. The order of elements within the list is not significant and may change in the future.
Understanding Metadata Fields in KeyView
Standardized metadata fields
Common metadata fields such as "Title", "Author", and "Subject" exist in many different file formats, but can be stored in different ways. For instance, one raster image format may store the image width as a key-value pair with key Width
. Another format may store the image width in bytes 16-19 of the file. You might want to access the width of all raster images in the same way, regardless of the source format. KeyView provides standardized field names for some common metadata so that you can do this, meaning you can use the same code to handle many different file formats.
When KeyView understands the meaning of a metadata field in a document, it outputs that data in a standardized field. Standardized fields are represented as KVMetadataElement objects with an eKey
set to a value other than KVMetadataKey_Other
. For standardized fields, the following is true:
-
eKey
is the standardized field key, which indicates the meaning of the field. -
pKey
is uniquely determined byeKey
. If you are handling the value of a standardized field based on itseKey
, you can ignorepKey
.pKey
is provided so that standardized fields can optionally be handled in the same way as non-standardized fields. -
pValue
is converted to a standard type, and, where appropriate, standard units.
Each standardized field is guaranteed to occur at most once in the metadata output. For example, the metadata output will contain zero or one KVMetadataElement objects with eKey
equal to KVMetadataKey_Title
.
For a full list of the standardized metadata fields, see Standardized Metadata Fields.
Non-standardized metadata fields
Non-standardized fields include user created fields, or fields that are specific to that type of document. KeyView deals with these in the following way: when a document contains a string representation of a field's key, pKey
is set to that string. Otherwise, KeyView generates a value of pKey
to describe the field. The field's value is stored in pValue
using an appropriate data type. The eKey is set to KVMetadataKey_Other
to signify that it is a non-standardized field.
Standardized value units
Many values are scalar values, and can be understood without reference to units – for example, page count. However, some values can be expressed in multiple units, and different file formats might store that information in different units – for example, image width could be stored in pixels, twips, or another graphic unit. When a field is standardized, its value is converted to the documented units, so all fields with a certain eKey
outputs their pValue
with the same units.
The exception to this is KVMetadataValue_String
values, where KeyView will output the string as it appears in the document. For example, depending on how the author field is stored in the document, it might be output as “John Smith”, “Smith, John”, “J. Smith”, or another form. The string will be encoded in the target encoding provided to fpInit().
Duplicated fields
It is possible for the same logical piece of metadata to appear multiple times in the KeyView metadata output for a document:
-
If a field is output as a standardized field, it may also be output as a non-standardized field with its original key/value.
-
If a field is stored in multiple instances in a document, KeyView may output a field for each instance. For example, if a raster image stores its width in both Exif and XMP metadata, KeyView may output a field for each.
-
KeyView may output multiple fields with different keys for the same piece of metadata to maintain backwards compatibility.
Absence of fields
If a field is not present in KeyView’s metadata output, that does not imply a value for that field. For instance, if the standardized field KVMetadataKey_CharacterCount
is not present, that does not imply that the document contains zero characters: instead, it indicates either that the document does not contain a “character count” metadata field, or that KeyView does not support that field for that format. If a field is not present in the document, KeyView will not attempt to construct that field, even where that field could be calculated from other information. For example, if the word count is missing from a word processing document, KeyView will not attempt to calculate it by counting the number of words in a file.
Similar standardized fields
Some standardized fields have similar meanings, e.g. KVMetadataKey_Author
and KVMetadataKey_Artist
: both represent people who in some way created the content of the document. KeyView does not attempt to interpret or define the meaning of these fields: If a document’s metadata contains an author
field, KeyView will attempt to standardize that as a KVMetadataKey_Author
field. Equivalently, it will attempt to standardize a document’s artist
field as a KVMetadataKey_Artist
field. KeyView will not attempt to determine if an author is an artist or vice versa.
Validation of fields
KeyView outputs metadata fields based on metadata stored within the document, but does not attempt to validate whether these fields are correct. For example:
-
KeyView will not check the word count stored in the metadata is the same as the actual number of words present in the document.
-
KeyView will not attempt to validate that the MIP Label stored in the document can be used to decrypt the document.
-
KeyView will not check that the signature in a signed executable is authentic.
-
KeyView will not check that a PDF marked as being
PDF/A conformant
actually conforms.
Mail metadata
The metadata for an e-mail message (the header fields such as "To", "CC", "Subject", and so on) are typically stored in the mail container (such as an MSG or EML file). To access this metadata you can call the function fpGetSubFileMetadataList(), in the Extract API.
The message body and any attachments are considered by KeyView as subfiles of the container. When you extract the message body, KeyView includes the header fields (by default). If you do not want to include this information, set the flag KVExtractionFlag_ExcludeMailHeader when you call fpExtractSubFile(). You might want to do this if you have already accessed the metadata and do not want to process it again.
Backwards compatibility
To help you migrate from fpGetOLESummaryInfo
to fpGetMetadataList
, KeyView reproduces the output of fpGetOLESummaryInfo
in the output of fpGetMetadataList
. For each KVSummaryInfoEx
object sumInfo
that would be output by fpGetOLESummaryInfo
, the output of fpGetMetadataList
will contain a KVMetadataElement object metaElem
such that metaElem.pKey
contains the name in sumInfo.pcType
, and and metaElem.pValue
contains the value represented by sumInfo.data
. Where KeyView can standardize the field represented by sumInfo
, the output of fpGetMetadataList
also contains that standardized field.
To help you migrate from fpGetSubFileMetaData() to fpGetSubFileMetadataList(), KeyView reproduces the default
and all
metadata sets output by fpGetSubFileMetaData() in the output of fpGetSubFileMetadataList(). The default
metadata set is output by fpGetSubFileMetaData() when the metaNameCount
member of the KVGetSubFileMetaArgRec
argument is set to 0, and the all
metadata set is output when metaNameCount
is set to -1. For each valid KVMetadataElem
object metaElemOld
in the default
and all
output of fpGetSubFileMetaData
, the output of fpGetSubFileMetadataList() will contain a KVMetadataElement object metaElemNew
such that metaElemNew.pKey
contains the name in metaElemOld.strType
, and metaElemNew.pValue
contains the value represented by metaElemOld.data
. Where KeyView can standardize the field represented by metaElemOld
, the output of fpGetMetadataList
also contains that standardized field.
Metadata Examples
If you want to process both standardized and non-standardized metadata fields, you can loop through the KVMetadataList without checking the eKey
member – both standardized and non-standardized metadata can be handled in the same way.
However, standardization allows you to handle particular metadata fields differently. Below are some illustrative examples of ways you might use standardized fields to act on documents.
//Ignore non-standardized fields if(element->eKey == KVMetadataKey_Other) { continue; }
//Ignore small images if(element->eKey == KVMetadataKey_ImageWidth) { int64_t width = *(int64_t*)element->pValue; if(width < 200) { break; } }
//Search for documents created by a certain company if(element->eKey == KVMetadataKey_Company) { KV_String company = *(KV_String*)element->pValue; if(strncmp("OpenText", company.pcString, company.cbSize) == 0) { return pathToInputFile; } }
//Find all documents created since the beginning of 2022 if(element->eKey == KVMetadataKey_Created) { int64_t created = *(int64_t*)element->pValue; //2022-01-01 00:00:00 UTC in Windows File Time if(created > 132854688000000000) { return pathToInputFile; } }