Example Scripts

Eduction includes the following example post-processing scripts.

Checksum Validation

The checksum_luhn.lua script verifies the checksum digit of each match using the Luhn algorithm, and reduces the score associated with the match if the checksum is wrong. The checksum_luhn_enmasse.lua script performs checksum validation as an en masse processing task, discards incorrect matches, and alters the score of correct matches to equal the proportion of matches that have the correct checksum digit.

You can use these scripts with the number_cc.ecr and number_sin_ca.ecr grammar files to validate most credit card numbers.

Spanish Identity Card Number Validation

You can use the checksum_dni_es.lua script with the number_dni_es.ecr grammar file to validate Spanish Documento Nacional de Identidad (national identity card) numbers.

Dutch Citizen Service Number Validation

You can use the checksum_bsn_nl.lua script with the number_bsn_nl.ecr grammar file to validate Dutch Citizen Service Numbers (Burgerservicenummer, or BSNs).

Geographical Coordinate Standardization

You can use the lat_long.lua script with the place_lat_long.ecr grammar file to convert and standardize the output of geographical coordinates.

Date and Time Standardization

You can use the datetime.lua script with the datetime_advanced_eng.ecr grammar file to convert and standardize the output of dates and times (and ranges) in English into a standardized format in cases where there are matches on several formats. For example, you can convert both 23/11/13 and Nov 23 2013 to 2013-11-23.

The datetime_advanced_eng.ecr grammar file can understand English natural language dates, and relative dates such as last Saturday morning. You can optionally provide a reference date for <today> in the Lua script to customize normalization of relative dates into standard formats, by using the following user parameter:

refdate

A date in ISO YYYY-MM-DD format.

If you do not set refdate, Eduction uses the current date as <today>.

For date and time range matches, this script sets the normalized text to <start>/<end>, and additionally adds STARTPOINT and ENDPOINT components that contain the associated dates or times. When there is a multiple date match (for example, 5th and 8th July matches as 5th July and 8th July), the script returns a comma-separated list, with a POINT component for each date.

For contextual date matches (such as two days after), the script includes the following optional parameters, which allow you to discard matches where the closest contextual date is too far away from the match:

contextmaxdistance Discard contextual matches that occur more than this many characters from the last date match.
contextpenaltydistance Reduce the score for matches that lie between this distance and the maximum cut-off (contextmaxdistance). This optional applies a linear reduction, which scales to zero at the maximum distance.

Filter Matches by Case

You can use the case_filter.lua example script to filter out matches by case, for example in personal name grammars.

To use this option, you must set MatchCase to False for the grammar. The script filters out any match that is not one of:

  • an exact match as specified in the grammar.
  • an upper case match (for example, JANE SMITH).
  • a title case match (for example Jane Smith).

NOTE: You might need to update this script to include case mappings for uncommon non-ASCII characters. The script provides sample mappings for common Latin characters with diacritics.