Introduction

To filter a file, you must first determine whether the file contains any subfiles (attachments, embedded OLE objects, and so on). A file that contains subfiles is called a container file. A container file has a main file (parent) and subfiles (children) embedded in the main file. The following are examples of container files:

  • Archive files such as ZIP, TAR, and RAR.
  • Mail messages such as Outlook (MSG) and Outlook Express (EML).
  • Mail stores such as Microsoft Outlook Personal Folders (PST), Mailbox (MBX), and Lotus Notes database (NSF).
  • PDF files that contain file attachments.
  • Compound documents with embedded OLE objects such as a Microsoft Word document with an embedded Excel chart.

NOTE: Document Readers indicates which formats are treated as container files and which are supported by the File Extraction API.

The subfiles might also be container files, creating a file hierarchy of multiple levels. For example, let us say an MSG file (the root parent) contains three attachments:

  • a Microsoft Word document that contains an embedded Microsoft Excel spreadsheet.
  • an AutoCAD drawing file (DWG).
  • an EML file with an attached Zip file, which in turn contains four archived files.

The following diagram shows the file's hierarchy.

NOTE: The parent MSG file contains four first-level children. The body text of a message file, although not a standalone file in the container, is considered a child of the parent file.