Introduction
A file can contain other files, which we call subfiles. Examples of subfiles include e-mail attachments and embedded OLE objects. A file that contains subfiles is called a container file.
The following are examples of container files:
- Archive files such as the ZIP, TAR, and RAR formats.
- Mail messages such as Outlook (MSG) and Outlook Express (EML).
- Mail stores such as Microsoft Outlook Personal Folders (PST), Mailbox (MBX), and Lotus Notes database (NSF).
- PDF files that contain file attachments.
- Compound documents with embedded OLE objects such as a Microsoft Word document with an embedded Excel chart.
File Content Extraction can extract subfiles from many formats - see the "Extract" column in the section Document Readers.
Filtering or exporting a container file might result in little or no output. A container might not have any text content of its own. However, you might be able to filter or export a container's subfiles. Through the File Content Extraction API you can see whether a file is a container, see how many subfiles it contains, and access or extract those subfiles for further processing.
To obtain all possible content from a file, you can filter it to obtain plain text, extract it to obtain subfiles, retrieve metadata the file stores about itself and metadata the file stores about its subfiles. You can then repeat this process for each subfile you’ve extracted.
Subfiles can be Containers
Subfiles might also be container files, creating a file hierarchy of multiple levels. For example, an MSG file might contain three attachments:
- a Microsoft Word document that contains an embedded Microsoft Excel spreadsheet.
- an AutoCAD drawing file (DWG).
- an EML file with an attached Zip file, which in turn contains four archived files.
NOTE: The MSG file contains four first-level children. The body text of a mail message is considered as a subfile (see Extract Mail Files for more information).