What is Metadata?
Documents may contain information about the document itself: we call this metadata. For instance, a raster image file contains metadata recording the image's width and height; a word processing document may contain metadata recording the document's author and title. Metadata can be represented by key-value pairs. For instance, a document's title can be represented as the key "Title" and the value "Annual Report". We refer to a single metadata key-value pair as a metadata field.
Containers (documents with subfiles) can contain metadata about their subfiles. For instance, a Personal Folders (.pst
) file is a container that can have multiple email messages as subfiles. A PST file may contain metadata, including the "To" and "From" fields of these subfiles.
Some documents contain metadata that is intended to be interpreted only by the parsing application, not the end user. This might be information like the number of records within the document, or the algorithm used to encrypt contents. File Content Extraction uses this metadata internally to interpret the file structure, but will only output metadata that is likely to be useful for the end user.
Access Metadata using the Python API
You can access document metadata through the metadata
attribute on a Document
object.
You can access subfile metadata through the metadata
attribute on a Subfile
object.
File Content Extraction uses Metadata
and MetadataElement
objects to represent metadata.
- A
Metadata
object is a container ofMetadataElement
objects, allowing you to iterate over them or look them up by key. - A
MetadataElement
object represents one metadata field.