filter_document
Filtering is the extraction of text from a document. This sample program makes use of the filter API method.
The program takes two positional arguments:
- an input file
- an output text file
By default, the ouput is encoded in UTF-8.
$ ./filter_document input_file output.txt
IMPORTANT: Not all document formats can be filtered. For example, trying to filter a PNG file produces an error message. For some file formats (notably emails), File Content Extraction treats the text as an embedded subfile that you should access by using the extraction API, not the filter API.