Grammar Files
A grammar file defines one or more entities that you want to extract.
-
Standard grammars. Eduction includes a collection of grammar files covering common entities such as names, social security numbers, postal addresses, telephone numbers, and so on. For a complete list of standard grammars, see Standard Grammars.
Standard grammar files are licensed by category and by language, so that you can have a license for any combination of category (for example, sentiment, place, or person) and language.
-
User grammars. You can extend the capabilities of Eduction by writing your own grammar files, either from scratch or by referencing existing entities.
NOTE: To reference the standard grammars in your own grammar files, you must have an appropriate license.
You might want to extend or write a grammar if you have specialist entities or values that the standard grammars do not match. These might be new values, or you can create grammars that combine standard entities into more complicated matches.
TIP: Before you extend a grammar, raise the issue with your OpenText support contact. The entity that you want to detect might be supported in an upcoming release, or OpenText might be able to add support in future. Using an official grammar means you do not have to maintain it.
Grammar files are created in XML format, and can be compiled into the proprietary ECR format. Compiling a grammar file into the ECR format makes it much faster to load at runtime.
Most of the standard grammar files are available only in ECR format. However, the Eduction package also includes several XML source grammars to allow you to easily extend the standard grammars (see Standard Grammar – Source). You can compile these, and your custom user grammars by using the edktool command-line tool.
NOTE: In Eduction version 12.7 and later, the standard grammars are in compressed ECR format, and edktool compiles grammars to compressed ECR format. You can still use existing uncompressed grammars from previous versions.
NOTE: Eduction can also use XML grammar files directly (that is, without compiling them to ECR files). However, in most cases OpenText recommends that you compile your grammars to improve performance.
There are two main ways to define entities:
- Use a dictionary of possible matches, for example to extract names of people or places.
- Use regular expressions (regex) to specify what a match looks like without having to list each possibility, for example to extract dates and times, or telephone numbers, which conform to a known pattern.
You can define entities recursively, and rules can refer to entities in other grammar files. This allows you to create more complicated entities that match data such as URLs or postal addresses.
Related Topics