Explore Filter SDK Features

You can use File Content Extraction by calling it from your own applications through one of its APIs. However, to help you get started, the SDK includes some non-production test utilities which allow you to use File Content Extraction from the command line and explore its functionality.

This section is an introductory tutorial that helps you to explore the key features of the Filter SDK, using the out-of-the-box command line test utilities filtertest and extracttest.

Download and Extract the Filter SDK

Download the Filter SDK from the Software Licensing and Downloads portal. Extract the zip file to a folder of your choice, which from now on we will refer to as %KEYVIEW_HOME%.

On Windows, you might need to install the included Visual C++ Redistributable package. In the vcredist folder of the SDK, right-click vcredist_2019.exe and then click Run as administrator.

License Key

File Content Extraction requires a license key, which is unique to your project. The test programs have an embedded trial license, which expires approximately five months after release, so you can follow this tutorial using the embedded license.

To use your own license, obtain a license key from the Entitlements tab of the Software Licensing and Downloads portal. Then set the environment variable KV_SAMPLE_PROGRAM_LICENSE_FROM_FILEPATH to the path of a text file containing only your license key. The test programs will read the environment variable and obtain your license from the specified file.

Sample Documents

Sample documents for this tutorial are available in an OpenText GitHub repository. Download these sample documents to a suitable directory (in this tutorial we will assume D:\sample-documents\).

Move the Test Programs

To run the included test programs, the programs must be located in the File Content Extraction bin directory. To simplify packaging, the programs are not located there by default (you can distribute the entire bin folder with your application but you should not include the test programs).

Copy
cd %KEYVIEW_HOME%\WINDOWS_X86_64
copy test\*.exe bin

Format Detection

The first feature that we will explore is file format detection. File Content Extraction can identify more than 2000 different file formats. Format detection is performed by analyzing the content of a file and does not use file extensions, because file extensions can be unreliable and in some cases are used by multiple formats.

To try format detection

  1. Use the filtertest -ah command:

    Copy
    cd %KEYVIEW_HOME%\WINDOWS_X86_64\bin
    filtertest -ah "D:\sample-documents\KeyViewFilterSDK_12.13.0_ReleaseNotes_en.pdf" "D:\output\format_detection.txt"
  2. Open the output file in a text editor, to see the results:

    Copy
    File Class:         1
    Format Number:      230
    Version:            1400
    Attributes:         0
    Description:        Adobe PDF (Portable Document Format)
    MIME Type:          application/pdf

    File Content Extraction correctly identified this file as Format Number: 230, which is an Adobe PDF file. Version: 1400 refers to PDF version 1.4. File Class: 1 refers to the adWORDPROCESSOR file class. For more information about format numbers and file classes, see File Formats.

    NOTE: The class and format ID assignment scheme was created for File Content Extraction. When applicable the File Formats documentation notes the MIME type, but not all file formats have MIME types.

You can now try format detection with your own test files.

Metadata Extraction

Documents can contain different types of metadata. For example, a document might have a Title and an Author, an image might have a width and a height, and so on. File formats store metadata in many different ways, including standard mechanisms like XMP, or by using something format-specific. File Content Extraction reports all types of metadata through a common interface, so that you can use the same method to obtain it, regardless of the underlying storage mechanism.

To try metadata extraction

  1. Use the filtertest -m command. For example:

    Copy
    filtertest -m "D:\sample-documents\KeyViewFilterSDK_12.13.0_ReleaseNotes_en.pdf" "D:\output\metadata.txt"
  2. To view the extracted metadata, open the output file in a UTF-8 capable text editor.

    Copy
    Name    Key    Type    Data    HasStandardAlternative    IsSuperseded
    "Title"    0    String    "IDOL KeyView Filter SDK 12.13.0 Release Notes"    true    false
    "Title"    4000    String    "IDOL KeyView Filter SDK 12.13.0 Release Notes"    false    false
    "Author"    0    String    "Micro Focus"    true    false
    "Author"    2000    String    "Micro Focus"    false    false
    "Create_DTM"    0    DateTime    "Fri Oct 21 14:21:17 2022"    true    false
    "Created"    1000    DateTime    "Fri Oct 21 14:21:17 2022"    false    false
    "LastSave_DTM"    0    DateTime    "Fri Oct 21 14:21:17 2022"    true    false
    "Modified"    1001    DateTime    "Fri Oct 21 14:21:17 2022"    false    false
    "PageCount"    0    Integer    10    true    false
    "PageCount"    5000    Integer    10    false    false
    "AppName"    0    String    "madbuild"    true    false
    "Application"    2001    String    "madbuild"    false    false

    File Content Extraction has successfully extracted metadata from the document, including its title, author, creation date, page count, and so on.

    This output demonstrates a useful feature, called field standardization. It might appear that there are duplicate pieces of metadata, but this is by design and can help you to handle multiple file formats without needing to write specialized code for each one. Field standardization standardizes metadata key names so that the same type of metadata can be accessed in a consistent way regardless of the file format. For example, the PDF file we processed had a native field named Create_DTM, containing the date that the document was created. File Content Extraction has generated a standard field named Created, which contains the same information. File Content Extraction will also generate a Created field for other file formats that contain a creation date, so that you can handle all of the relevant formats with the same code.

    In this example output, the native metadata fields that have been standardized have the "HasStandardAlternative" property set to "true", so that you can identify them. For more information about using the metadata API and about field standardization, see the section Use the Metadata API.

  3. Open the PDF file in Adobe Reader. Go to File > Properties and compare what you see to the output from File Content Extraction.

You can now try extracting metadata from your own test files.

Text Filtering

Filtering is the extraction of text from a file, without application-specific markup. File Content Extraction can extract text from many different file formats. By default, File Content Extraction extracts visible text - the same text that you might see if you opened the file in its native application, or printed it. File Content Extraction can also extract "hidden" text, additional text that is present in a file but is not usually visible.

To try text filtering

  1. Filter the visible text from a sample PDF file, by using the filtertest sample program.

    Copy
    filtertest "D:\sample-documents\KeyViewFilterSDK_12.13.0_ReleaseNotes_en.pdf" "D:\output\filter_output.txt"
  2. To view the extracted text, open the output file in a UTF-8 capable text editor. Open the source PDF file in Adobe Reader. You can see that the filter output contains all of the visible text from the original document.

You can now try filtering some of your own test files.

Subfile Extraction

Many file formats are "containers" - files that contain other files. Archive files such as ZIPs, and e-mail messages with attachments are the most obvious of these. However, many other file formats can contain subfiles, and File Content Extraction can help you access these subfiles.

To try subfile extraction

  1. Create a directory in which to place the extracted files.

    Copy
    mkdir "D:\output\extract-dir"
  2. Use the sample program extracttest to extract subfiles from a sample file.

    Copy
    extracttest "D:\sample-documents\demo_HAS_EMBEDDED_DOC.zip" "D:\output\extract-dir"
  3. Open the extract directory. You should see that there is a log file. This file shows how File Content Extraction has processed the source file and what is has extracted. The log also shows information about the subfiles, some of which was available before extraction, and some only available after extraction.

  4. Browse the contents of the extract directory. As detailed in the log file, you should see that this sample ZIP archive contained a PowerPoint file. The PowerPoint file contained embedded Word and Excel documents, and these have also been extracted.

    NOTE: Images that the native viewer shows inline with the text content of a file are not, by default, considered to be subfiles. However, you can configure File Content Extraction to extract images in the same way as other subfiles. For more information, see Extract Images.

You can now try extracttest with your own test files. Remember to delete the contents of the extract-dir folder between each iteration.

Conclusion

In this tutorial, you used the Filter SDK to perform file format detection, metadata extraction, text filtering, and subfile extraction. Next, you might like to try using these features through the API - see Getting Started.