Understand the Subfile Hierarchy

When you extract a container file, the paths or relationships between the subfiles might be irrelevant. For example, you might want to filter the subfiles contained in a ZIP archive, but you might not care about the file and folder structure.

File Content Extraction provides information that enables you to recreate the hierarchy, for those use cases where it is important. You can use the hierarchy to recreate the directory structure in a file system, or to process documents according to their relationship to each other. For example, if you use File Content Extraction as part of a search engine, the hierarchical information could be used to enable your users to search a document's parent or siblings within the container. In addition, when a document is returned to the user, the parent, sibling, or child documents could be returned as recommendations.

To obtain information about the position of a subfile in the hierarchy, use the parent and children attributes on the Subfile object. You might need to access these attributes for multiple subfiles to reconstruct the entire hierarchy.

When File Content Extraction returns the parent index -1, this indicates that the subfile has no parent and is at the root level within its container.

Example

You might extract a PST file that contains five subfiles. The following diagram shows the available hierarchy information for each subfile:

Using this information, you can recreate the hierarchy shown in the following diagram.

Create a Root Node

You can instruct File Content Extraction to create an artificial root node at the top of the hierarchy. The subfile index of the root node is always 0. This artificial root node provides a reference point from which the hierarchy is created, and ensures that:

  • the highest level of the hierarchy includes no more than one node.
  • every node (except the root node) has a parent.

You can extract the root node as a directory called root.

When you choose to create a root node, the number of subfiles in a container, as counted by File Content Extraction, increases by one.

To create a root node

  • In the Python API, call the method create_root_node on your session configuration.

    session.config.create_root_node(True)