Data Sources
The IDOL PCI Package contains a variety of different kinds of entities to describe payment card information that is protected by payment card industry regulations. The following sections provide some information about how this information is compiled.
For all of these types of information, as much test data is acquired as possible to test the recall metric of the algorithms. Many millions of examples are run through the grammars to ensure that all patterns in usage are covered.
Names
An international database containing over 100 million individuals is analyzed to identify the structure and characteristics of names in each country. In doing so, extensive lists of the frequencies of occurrence of given names and family names are used to generate strong identification grammars for names.
Other sources are also included for some countries, such as census data and lists of popular baby names. The list is also checked by performing Eduction over a large corpus of public data to find forenames and surnames that result in too many false positives, and add them to a name stop list.
In addition, rules are included to handle linguistic information, such as transliteration (for example, from the Cyrillic or Greek alphabets), or the use or removal of diacritic marks.
Dates
A large corpus of documents from public sources is processed to analyze the occurrence and format of dates. In this way, coverage of all common and less-common formats is built up, while enabling a likelihood measure to indicate the confidence that the characters identified are a payment card date, rather than an unrelated date or other alphanumeric string.
PCI Numbers
The formats of the PCI numbers entities are sourced from the PCI Security Standards Council, and other public sources where appropriate.