Example Scripts
Eduction includes the following example post-processing scripts.
Checksum Validation
The checksum_luhn.lua
script verifies the checksum digit of each match using the Luhn algorithm, and reduces the score associated with the match if the checksum is wrong. The checksum_luhn_enmasse.lua
script performs checksum validation as an en masse processing task, discards incorrect matches, and alters the score of correct matches to equal the proportion of matches that have the correct checksum digit.
You can use these scripts with the number_cc.ecr
and number_sin_ca.ecr
grammar files to validate most credit card numbers.
Spanish Identity Card Number Validation
You can use the checksum_dni_es.lua
script with the number_dni_es.ecr
grammar file to validate Spanish Documento Nacional de Identidad (national identity card) numbers.
Dutch Citizen Service Number Validation
You can use the checksum_bsn_nl.lua
script with the number_bsn_nl.ecr
grammar file to validate Dutch Citizen Service Numbers (Burgerservicenummer, or BSNs).
Geographical Coordinate Standardization
You can use the lat_long.lua
script with the place_lat_long.ecr
grammar file to convert and standardize the output of geographical coordinates.
Date and Time Standardization
You can use the datetime.lua
script with the datetime_advanced_eng.ecr
grammar file to convert and standardize the output of dates and times (and ranges) in English into a standardized format in cases where there are matches on several formats. For example, you can convert both 23/11/13 and Nov 23 2013 to 2013-11-23.
The datetime_advanced_eng.ecr
grammar file can understand English natural language dates, and relative dates such as last Saturday morning. You can optionally provide a reference date for <today>
in the Lua script to customize normalization of relative dates into standard formats, by using the following user parameter:
refdate
|
A date in ISO YYYY-MM-DD format. If you do not set |
For date and time range matches, this script sets the normalized text to <start>/<end>
, and additionally adds STARTPOINT
and ENDPOINT
components that contain the associated dates or times. When there is a multiple date match (for example, 5th and 8th July matches as 5th July and 8th July), the script returns a comma-separated list, with a POINT component for each date.
For contextual date matches (such as two days after), the script includes the following optional parameters, which allow you to discard matches where the closest contextual date is too far away from the match:
contextmaxdistance
|
Discard contextual matches that occur more than this many characters from the last date match. |
contextpenaltydistance
|
Reduce the score for matches that lie between this distance and the maximum cut-off (contextmaxdistance ). This optional applies a linear reduction, which scales to zero at the maximum distance. |
Filter Matches by Case
You can use the case_filter.lua
example script to filter out matches by case, for example in personal name grammars.
To use this option, you must set MatchCase
to False
for the grammar. The script filters out any match that is not one of:
- an exact match as specified in the grammar.
- an upper case match (for example, JANE SMITH).
- a title case match (for example Jane Smith).
NOTE: You might need to update this script to include case mappings for uncommon non-ASCII characters. The script provides sample mappings for common Latin characters with diacritics.