Use a Compilation Configuration File
When you compile a grammar by using edktool, you can add an optional JSON configuration file to specify additional options for compilation.
Configure Character Expansions
You can configure character expansions, which detect certain characters as if they are a different character. For example, you can detect different varieties of punctuation characters to match a standard form that you use in your grammar files.
To use character expansions, you specify an expansions
array which contains a list of your expansions. Each array item has a src
and dest
element. The source and destination characters should be considered as a single list where any character in the list is expanded to any other. The character chosen as the "src" character is significant only because it is used in normalized matches in place of any "dest" character.
Consider the following example configuration:
{ "expansions": [ { "src": "a", "dest": ["b", "c"] } ] }
If your grammar contains only the following pattern:
<pattern>ade</pattern>
Eduction expands the pattern to:
<pattern>[abc]de</pattern>
So if your input contains the following text:
ade bde cde dde
Eduction matches ade
, bde
, and cde
, and produces the normalized matches ade
, ade
, ade
.
If your grammar contains only the following pattern:
<pattern>bde</pattern>
Eduction expands the pattern to:
<pattern>[abc]de</pattern>
...which produces the same matches (ade
, bde
, and cde
) and the same normalized matches (ade
, ade
, ade
) as before.
You could use character expansions if you have written a grammar file where the patterns contain space characters, but you also want to match non-breaking spaces or other Unicode space characters.
Optimize Case-Insensitive Matching
When you have a grammar file that contains case-sensitive entities, but you want to find matches regardless of case, you can run Eduction with the parameter MatchCase=False
. When Eduction loads a grammar file and MatchCase=False
, it optimizes the entities for case-insensitive matching to improve run-time performance. However, this can increase the time required to initialize Eduction, so if you regularly use a grammar file with MatchCase=False
you can optimize the entities for case-insensitive matching at compile-time instead.
To optimize entities for case-insensitive matching, set the option alternativeCaseArcs
to true
in your compilation configuration file:
{ "alternativeCaseArcs": true }
After compiling a grammar file with this option, you can still use the MatchCase
parameter to choose whether matching is case-sensitive.
To obtain the best case-insensitive performance, you should write a grammar file using only upper or lower case and then normalize the input by setting the CaseNormalization
parameter. For more information, see Case Normalization. Compiling a grammar file with alternativeCaseArcs
set to true
is useful if you cannot easily modify your grammar file(s), but only reduces the time required to initialize Eduction (it does not reduce the time required for matching).
To recompile an existing grammar file with alternativeCaseArcs
set to true
you could include the existing grammar in a new grammar file as shown in the example below, and then compile the new grammar using edktool
.
<?xml version="1.0" encoding="UTF-8"?> <grammars version="4.0"> <include path="published/grammar.ecr" type="public"/> </grammars>
Use the Configuration File
You add a configuration file to your compilation by setting the -c command-line option in the compile command. For more information, see Compile.
When you compile a grammar by using the Eduction SDK, you can specify the path to a compilation configuration file by using one of the following options:
-
C API: the
EdkLoadResourceFileWithCompileConfig
andEdkLoadResourceBufferWithCompileConfig
functions. -
Java API: the
loadResourceFile
,loadResourceFiles
, andloadResourceBuffer
methods in theTextExtractionEngine
interface. -
.NET API: the
GetCompiler
method on theEDKFactory
class.
For more information, refer to the API documentation.