Explore Filter SDK Features
You can use File Content Extraction by calling it from your own applications through one of its APIs. However, to help you get started, the SDK includes some non-production test utilities which allow you to use File Content Extraction from the command line and explore its functionality.
This section is an introductory tutorial that helps you to explore the key features of the Filter SDK, using the out-of-the-box command line test utilities filtertest
and extracttest
.
Download and Extract the Filter SDK
Download the Filter SDK from the Software Licensing and Downloads portal. Extract the zip file to a folder of your choice, which from now on we will refer to as %KEYVIEW_HOME%
.
- Go to the Software Licensing and Downloads portal.
-
On the Downloads tab, select your product, product name, and version from the drop-down menus.
-
From the list of available files, select and download the following files:
-
KeyviewFilterSDK_25.1.0_PLATFORM.zip
, wherePLATFORM
is your software platform. For example,KeyviewFilterSDK_25.1.0_WINDOWS_X86_64.zip
. -
KeyviewFilterSDK_25.1.0_Documentation.zip
-
-
Extract the SDK to a folder of your choice.
On Windows, you might need to install the included Visual C++ Redistributable package. In the vcredist
folder of the SDK, right-click vcredist_2019.exe
and then click Run as administrator.
License Key
File Content Extraction requires a license key, which is unique to your project. The test programs have an embedded trial license, which expires approximately five months after release, so you can follow this tutorial using the embedded license.
To use your own license, obtain a license key from the Entitlements tab of the Software Licensing and Downloads portal. Then set the environment variable KV_SAMPLE_PROGRAM_LICENSE_FROM_FILEPATH
to the path of a text file containing only your license key. The test programs will read the environment variable and obtain your license from the specified file.
Sample Documents
Sample documents for this tutorial are available in an OpenText GitHub repository. Download these sample documents to a suitable directory (in this tutorial we will assume D:\sample-documents\
).
Move the Test Programs
To run the included test programs, the programs must be located in the File Content Extraction bin directory. To simplify packaging, the programs are not located there by default (you can distribute the entire bin
folder with your application but you should not include the test programs).
cd %KEYVIEW_HOME%\WINDOWS_X86_64
copy test\*.exe bin
Format Detection
The first feature that we will explore is file format detection. File Content Extraction can identify more than 2000 different file formats. Format detection is performed by analyzing the content of a file and does not use file extensions, because file extensions can be unreliable and in some cases are used by multiple formats.
To try format detection
-
Use the
filtertest -ah
command:Copycd %KEYVIEW_HOME%\WINDOWS_X86_64\bin
filtertest -ah "D:\sample-documents\KeyViewFilterSDK_12.13.0_ReleaseNotes_en.pdf" "D:\output\format_detection.txt" -
Open the output file in a text editor, to see the results:
CopyFile Class: 1
Format Number: 230
Version: 1400
Attributes: 0
Description: Adobe PDF (Portable Document Format)
MIME Type: application/pdfFile Content Extraction correctly identified this file as Format Number: 230, which is an Adobe PDF file. Version: 1400 refers to PDF version 1.4. File Class: 1 refers to the
adWORDPROCESSOR
file class. For more information about format numbers and file classes, see File Formats.NOTE: The class and format ID assignment scheme was created for File Content Extraction. When applicable the File Formats documentation notes the MIME type, but not all file formats have MIME types.
You can now try format detection with your own test files.
Metadata Extraction
Documents can contain different types of metadata. For example, a document might have a Title and an Author, an image might have a width and a height, and so on. File formats store metadata in many different ways, including standard mechanisms like XMP, or by using something format-specific. File Content Extraction reports all types of metadata through a common interface, so that you can use the same method to obtain it, regardless of the underlying storage mechanism.
To try metadata extraction
-
Use the
filtertest -m
command. For example:Copyfiltertest -m "D:\sample-documents\KeyViewFilterSDK_12.13.0_ReleaseNotes_en.pdf" "D:\output\metadata.txt"
-
To view the extracted metadata, open the output file in a UTF-8 capable text editor.
CopyName Key Type Data HasStandardAlternative IsSuperseded
"Title" 0 String "IDOL KeyView Filter SDK 12.13.0 Release Notes" true false
"Title" 4000 String "IDOL KeyView Filter SDK 12.13.0 Release Notes" false false
"Author" 0 String "Micro Focus" true false
"Author" 2000 String "Micro Focus" false false
"Create_DTM" 0 DateTime "Fri Oct 21 14:21:17 2022" true false
"Created" 1000 DateTime "Fri Oct 21 14:21:17 2022" false false
"LastSave_DTM" 0 DateTime "Fri Oct 21 14:21:17 2022" true false
"Modified" 1001 DateTime "Fri Oct 21 14:21:17 2022" false false
"PageCount" 0 Integer 10 true false
"PageCount" 5000 Integer 10 false false
"AppName" 0 String "madbuild" true false
"Application" 2001 String "madbuild" false falseFile Content Extraction has successfully extracted metadata from the document, including its title, author, creation date, page count, and so on.
This output demonstrates a useful feature, called field standardization. It might appear that there are duplicate pieces of metadata, but this is by design and can help you to handle multiple file formats without needing to write specialized code for each one. Field standardization standardizes metadata key names so that the same type of metadata can be accessed in a consistent way regardless of the file format. For example, the PDF file we processed had a native field named
Create_DTM
, containing the date that the document was created. File Content Extraction has generated a standard field namedCreated
, which contains the same information. File Content Extraction will also generate aCreated
field for other file formats that contain a creation date, so that you can handle all of the relevant formats with the same code.In this example output, the native metadata fields that have been standardized have the "HasStandardAlternative" property set to "true", so that you can identify them. For more information about using the metadata API and about field standardization, see the section Use the Metadata API.
-
Open the PDF file in Adobe Reader. Go to File > Properties and compare what you see to the output from File Content Extraction.
You can now try extracting metadata from your own test files.
Text Filtering
Filtering is the extraction of text from a file, without application-specific markup. File Content Extraction can extract text from many different file formats. By default, File Content Extraction extracts visible text - the same text that you might see if you opened the file in its native application, or printed it. File Content Extraction can also extract "hidden" text, additional text that is present in a file but is not usually visible.
To try text filtering
-
Filter the visible text from a sample PDF file, by using the filtertest sample program.
Copyfiltertest "D:\sample-documents\KeyViewFilterSDK_12.13.0_ReleaseNotes_en.pdf" "D:\output\filter_output.txt"
-
To view the extracted text, open the output file in a UTF-8 capable text editor. Open the source PDF file in Adobe Reader. You can see that the filter output contains all of the visible text from the original document.
You can now try filtering some of your own test files.
Subfile Extraction
Many file formats are "containers" - files that contain other files. Archive files such as ZIPs, and e-mail messages with attachments are the most obvious of these. However, many other file formats can contain subfiles, and File Content Extraction can help you access these subfiles.
To try subfile extraction
-
Create a directory in which to place the extracted files.
Copymkdir "D:\output\extract-dir"
-
Use the sample program
extracttest
to extract subfiles from a sample file.Copyextracttest "D:\sample-documents\demo_HAS_EMBEDDED_DOC.zip" "D:\output\extract-dir"
-
Open the extract directory. You should see that there is a log file. This file shows how File Content Extraction has processed the source file and what is has extracted. The log also shows information about the subfiles, some of which was available before extraction, and some only available after extraction.
-
Browse the contents of the extract directory. As detailed in the log file, you should see that this sample ZIP archive contained a PowerPoint file. The PowerPoint file contained embedded Word and Excel documents, and these have also been extracted.
NOTE: Images that the native viewer shows inline with the text content of a file are not, by default, considered to be subfiles. However, you can configure File Content Extraction to extract images in the same way as other subfiles. For more information, see Extract Images.
You can now try extracttest
with your own test files. Remember to delete the contents of the extract-dir
folder between each iteration.
Conclusion
In this tutorial, you used the Filter SDK to perform file format detection, metadata extraction, text filtering, and subfile extraction. Next, you might like to try using these features through the API - see Getting Started.