Extract
This command extracts entities from a document. It can print the output to a file, or to the console. You can use this option to test your grammars.
The following table describes the available parameters for this command.
-l <licensefile>
|
The file containing a valid license key for Eduction. This file can include the version key, concatenated to the license key with a semicolon ( If you do not specify a license key, edktool attempts to load the license |
-i <inputfile>
|
The file to perform entity extraction on. The input file must be a UTF-8 encoded plain text file. |
-c <configfile>
|
A configuration file controlling the extraction. See Eduction Configuration File. You can specify one or more grammar files and one or more entities in place of a configuration file. Specifying a configuration file overrides the grammar or entity parameters. |
-g <grammarfile>
|
A grammar file to use. Edktool ignores this option if you set a configuration file with If you provide a grammar file but do not specify any entities with You can use wildcard expressions in this parameter. See Wildcard Expressions in edktool. |
-e <entity>
|
The entities to extract. Separate multiple entities with a comma. Edktool ignores this option if you set a configuration file with You can use wildcard expressions in this parameter. See Wildcard Expressions in edktool. |
-o <outputfile>
|
The file to write the results of the extraction to. The output file is an XML file that contains the matched entities. |
-q
|
(Optional) Run in quiet mode. In this case, edktool removes all descriptive messages from the output and shows the XML matchlist only (that is, an XML document with all the matches and any configured metadata).
|
-r <redaction_file>
|
(Optional) The name of a copy of the input file to produce, with all matches redacted. For example:The driver ########## was questioned. |
-b
|
Read the input file in binary mode, rather than text mode. If you create a grammar file that matches entities with only Windows (CR LF) line endings and you run edktool on Windows, edktool must read the input file in binary mode for it to find any matches. OpenText recommends that you create grammar files capable of handling both Windows and Unix line endings. |
The extract option requires an input file (in plain text format) and either a configuration file or a grammar file. If you do not provide a configuration file, edktool
searches the file for any specified entities in the specified grammar (or all entities, if none are specified). For example, in the simplest command line:
C:\>edktool e -i myData.txt -g grammar1.ecr,grammar2.ecr
This command runs edktool without a configuration file. It processes the data file myData.txt
with the grammar files grammar1.ecr
and grammar2.ecr
. Eduction identifies all the entities in the two grammar files, and matches on these. The output is sent to the console in XML format, identifying matches in the data file and using the entity names to generate field names for the matches that contain the matched data. It matches the entire body of the plain text input file.
Redact Extraction Results
You can enable redaction on extracted matches in edktool either by setting RedactedOutput
to True
in the edktool configuration file, or by specifying a redaction file using the -r
parameter at the command line.
The entities identified as matches by edktool
are redacted from the input text to form the redacted output. For example:
Input:
The driver Joe Bloggs was questioned.
Output:
The driver ########## was questioned.
Eduction sends redacted output to the file specified in the -r
parameter. If you do not specify this argument but you have enabled redaction in the configuration file, Eduction displays redacted output in the console after the list of matches, unless you have set the -q
parameter at the command line to enable quiet mode. In quiet mode, edktool does not display redacted output in the console.
Examples
edktool e -i myPlainTextFile.txt -g myGrammar.ecr
Extracts all entities in myGrammar.ecr
from myPlainTextFile.txt
, sending the output to the console in XML format, with the field names for the matching text automatically generated from the entity names found in myGrammar.ecr
.