Assess and Measure Eduction Grammars

There might be times when you want to check the effectiveness and performance of your .ECR or .XML grammars.

For example, you might want to check:

  • how effective a particular grammar file is at extracting the entities you require from your data.
  • how a change to a grammar file affects performance.

The edktool command-line tool has two features, Assess and Measure, that enable you to find out this kind of information easily (for a full reference for these functions, see Assess and Measure).

Assess

This feature takes a list of phrases that you expect to contain matches, and checks whether they do contain a match.

Alternatively, it can take a list of phrases that you do not expect to contain matches, and check that they do not contain a match.

You can use this feature to:

  • test the suitability of an Eduction grammar for the task you would like it to do.

  • monitor the accuracy of an Eduction grammar while you develop it.

  • ensure that further development does not introduce problems to an Eduction grammar that already performs well.

You can set up an assessment by using the following three-stage procedure.

Create the Input Files

Create a valid file, which contains an expected match in each line. The expected match can be from either your real data, or from artificial sample data. For example:

My mate is Bob Smith
My name is Bob Smith
My mate is Bob Smith
Barbara smith
smith,benjamin
SMYTH, Robert
Dr Bob B. Smith Jr.
Bob SMITH lives here
(etc.)

You can also create a list of valid exact matches (examples of text that must be matched in their entirety).

Alternatively, you can create an invalid file, where each line must contain no match. For example:

Black Smith
Bob up and down
smith, smyth
She is called Barbara. Smith is not her surname.
Benjamins myth
(etc.)

You can also create a list of non-valid exact matches (examples of text that might or might not contain a match, but must not be matched in their entirety).

Alternatively, you can set up your input file so that it refers to matches by all available entities, or only by specific entities (for example, male_name_all).

NOTE: Eduction ignores blank lines in the input file, and lines that start with //.

Update the Configuration File

You can run an extraction from the command line without a configuration file, but in most cases it is easier to use one.

You must add the assessment sections to an Eduction file that would otherwise run a successful extraction. Each assessment must contain either a valid input file, an invalid input file, or both. You must number multiple sections consecutively, starting from 0 (zero).

[assessment0]
Valid=my_valid_1.txt

[assessment1]
Valid=my_valid_2.txt
Invalid=my_invalid_1.txt
Exact=True
Entities=my_entity,my_other_entity

[assessment2]
Invalid=my_filename.txt
Exact=False
Entities=my_other_entity

To require exact matches, set the Exact parameter to True.

To restrict the extraction to a subset of the available entities, you can:

  • set the Entities parameter in the assessment section to a comma-separated list of the entities you want to extract.

  • set the ResourceFiles parameter in the standard CFG configuration file.

Run the Assessment

To run the assessment, open a command prompt in the edktool directory and type:

edktool a -l <license> -c <config> [-o <output>] [-a]

Alternatively, to run a single assessment section using the command line and no configuration file type:

edktool a -l <license> -g <grammars> [-e <entities_for_extraction>] [-x] [-o <output>] [-a] [-m <entities_for_matching>] [-v <valid_input> | -w <invalid_input>]

The following table lists the command-line parameters that you can use, and their functions.

Parameter Function
-x Equivalent to setting exact=True in the configuration file.
-m Equivalent to including the entities= parameter in the assessment section of the configuration file.
-v Equivalent to including the valid= parameter in the configuration file.
-w Equivalent to including the invalid= parameter in the configuration file.
-a Includes all examples in the results, including those which passed the assessment.
-o Sends the results to a specific file, instead of to the console by default.

Use the Assessment Results

The assessment results contain:

  • a list of all the examples that did not behave as expected.

  • all relevant statistics (some statistics are relevant only if you specified both a valid file and an invalid file).

If all the tests in the assessment pass successfully, the grammar file is working as expected on the data you have given it. In practice, there are usually some failures.

Often there are common themes running through the failures. Perhaps the grammar file only matches text with certain capitalization, or where the data is written in a certain format. This can provide immediate information on how to fix the issues in cases where you have access to the XML source.

When there is a failure that you do not understand, you might find it useful to expand the valid or invalid grammar file with a selection of similar examples, perhaps with different words, formats, capitalization, or punctuation. After you rerun the consistency test with expanded data, and it might become much clearer what the problem is.

Statistics

The results include statistics for recall (if a valid file is present) and true negative rate (if an invalid file is present). If both types of file are present in a single assessment section, it also includes statistics for accuracy, precision, and F1 score (see Results Relevance).

The statistics depend on the data provided in the valid or invalid files. In all cases, the statistic falls in the range 0.0000 - 1.0000; a higher score represents a more successful grammar file.

After your initial assessment run, if you make modifications to the grammar file you can use these statistics to compare the performance of the old and new grammars.

NOTE: If you make improvements to a grammar file based on the results of your assessments, these improvements are targeted at the data provided. These statistics are likely to be over-inflated compared to the statistics for generic data.

You can use an assessment that has examples that all pass perfectly as a basis for further development. If the statistics in the assessment drop beneath 1.0000, you can identify that the development has introduced a problem to the grammar file.

Measure

This feature compares the matches found in separate extraction runs on the same input data. You can run this feature from Eduction to view any differences between the two extraction runs. The results are in XML format, and include the metadata.

You can use this feature to:

  • monitor the improvement of a grammar file that originally returned too many false matches.

  • compare the results of extraction by different versions of a grammar file on the same data to test whether modifications are beneficial.

To set up this feature:

  1. Run an extraction using one version of the grammar file. Your input data must be in plain text format.

  2. Save the output.

  3. (Optional) Remove any false matches from the output by deleting the three XML lines corresponding to the false match. This step is appropriate only if you want to form a benchmark output file with the aim of developing the grammar file to produce output very similar to the benchmark.

    You can also modify existing matches by making them shorter or longer, or you can add entirely new matches that should be found in the input text.

    NOTE: You must ensure that you specify the correct offset (in bytes) when you modify or add matches.

  4. Run an extraction on the same plain text input as before, using the newer version of the grammar file. Save the output under a different file name.
  5. At the command line, run the Measure command: 

    edktool m -e <expected_file> -a <actual_file> [-o <output_file>]

Use the Measure Results

The output is an XML document which lists all the differences between the output of <expected_file> and the output of <actual_file>. You can use the list of differences to monitor how changes to the XML source affect the output on real data. You can then decide which changes are beneficial, and which are not.

Statistics

The Measure results include statistics for precision and recall, although these are relevant only when the expected file is a benchmark, and the aim is to produce a grammar file whose output is as close as possible to this benchmark.