Understand the Subfile Hierarchy
When you extract a container file, the paths or relationships between the subfiles might be irrelevant. For example, you might want to filter the subfiles contained in a ZIP archive, but you might not care about the file and folder structure.
File Content Extraction provides information that enables you to recreate the hierarchy, for those use cases where it is important. You can use the hierarchy to recreate the directory structure in a file system, or to process documents according to their relationship to each other. For example, if you use File Content Extraction as part of a search engine, the hierarchical information could be used to enable your users to search a document's parent or siblings within the container. In addition, when a document is returned to the user, the parent, sibling, or child documents could be returned as recommendations.
To obtain information about the position of a subfile in the hierarchy, call extGetSubFileInfo
, which returns an object of the ExtSubFileInfo
class. Then, use the getParentIndex()
and getChildArray()
methods to identify the subfile’s parent and children. You might need to call extGetSubFileInfo
for multiple subfiles to reconstruct the entire hierarchy.
When File Content Extraction returns the parent index -1
, this indicates that the subfile has no parent and is at the root level within its container.
Example
You might extract a PST file that contains five subfiles. The following diagram shows the available hierarchy information for each subfile:
Using this information, you can recreate the hierarchy shown in the following diagram.
Create a Root Node
You can instruct File Content Extraction to create an artificial root node at the top of the hierarchy. The subfile index of the root node is always 0
. This artificial root node provides a reference point from which the hierarchy is created, and ensures that:
- the highest level of the hierarchy includes no more than one node.
- every node (except the root node) has a parent.
You can extract the root node as a directory called root
.
When you choose to create a root node, the number of subfiles in a container, as counted by File Content Extraction, increases by one.
To create a root node
- In the Java API, call
setCreateRootNode
on theExtOpenDocConfig
class that you pass to theextOpenDocument
method. When you request a root node, the return value of thegetNumSubFiles
method (on theExtMainFileInfo
object) increases by one. For example, a Microsoft Word document with three embedded OLE objects has4
subfiles when the root node is enabled, rather than3
.