What is Metadata?
Documents may contain information about the document itself: we call this metadata. For instance, a raster image file contains metadata recording the image's width and height; a word processing document may contain metadata recording the document's author and title. Metadata can be represented by key-value pairs. For instance, a document's title can be represented as the key "Title" and the value "Annual Report". We refer to a single metadata key-value pair as a metadata field.
Containers (documents with subfiles) can contain metadata about their subfiles. For instance, a Personal Folders (.pst
) file is a container that can have multiple email messages as subfiles. A PST file may contain metadata, including the "To" and "From" fields of these subfiles.
Some documents contain metadata that is intended to be interpreted only by the parsing application, not the end user. This might be information like the number of records within the document, or the algorithm used to encrypt contents. File Content Extraction uses this metadata internally to interpret the file structure, but will only output metadata that is likely to be useful for the end user.
Access Metadata using the C API
You can access subfile metadata using the Extract interface function fpGetSubFileMetadataList().
File Content Extraction uses the structures KVMetadataList and KVMetadataElement to store metadata.
- The KVMetadataList structure represents a list of KVMetadataElement objects, allowing you to iterate over them. The order of elements within the list is not significant and may change in the future.
- The KVMetadataElement structure represents one metadata field. It contains a string representation of the field’s key, and the field’s value stored in an appropriate type.
Mail Metadata
The metadata for an e-mail message (the header fields such as "To", "CC", "Subject", and so on) are typically stored in the mail container (such as an MSG or EML file). To access this metadata you can call the function fpGetSubFileMetadataList(), in the Extract API.
The message body and any attachments are considered by File Content Extraction as subfiles of the container. When you extract the message body, File Content Extraction includes the header fields (by default). If you do not want to include this information, set the flag KVExtractionFlag_ExcludeMailHeader when you call fpExtractSubFile(). You might want to do this if you have already accessed the metadata and do not want to process it again.
Backwards Compatibility
To help you migrate from fpGetSubFileMetaData() to fpGetSubFileMetadataList(), File Content Extraction reproduces the default
and all
metadata sets output by fpGetSubFileMetaData() in the output of fpGetSubFileMetadataList(). The default
metadata set is output by fpGetSubFileMetaData() when the metaNameCount
member of the KVGetSubFileMetaArgRec
argument is set to 0, and the all
metadata set is output when metaNameCount
is set to -1. For each valid KVMetadataElem
object metaElemOld
in the default
and all
output of fpGetSubFileMetaData
, the output of fpGetSubFileMetadataList() will contain a KVMetadataElement object metaElemNew
such that metaElemNew.pKey
contains the name in metaElemOld.strType
, and metaElemNew.pValue
contains the value represented by metaElemOld.data
. Where File Content Extraction can standardize the field represented by metaElemOld
, the output of fpGetMetadataList
also contains that standardized field.