Data Sources
The IDOL PHI Package contains a variety of different kinds of entities to describe healthcare information that is protected by regulations such as HIPAA. The following sections provide some information about how this information is compiled.
For each entity type, extensive testing is performed to ensure the precision and recall metrics are optimized. Many millions of examples are run through the package to test full coverage of the patterns and algorithms involved.
Names
An international database containing over 100 million individuals is analyzed to identify the structure and characteristics of names in each country. In doing so, extensive lists of the frequencies of occurrence of given names and family names are used to generate strong identification grammars for names.
Other sources are also included for some countries, such as census data and lists of popular baby names. The list is also checked by performing Eduction over a large corpus of public data to find forenames and surnames that result in too many false positives, and add them to a name stop list.
In addition, rules are included to handle linguistic information, such as transliteration (for example, from the Cyrillic or Greek alphabets), or the use or removal of diacritic marks.
Age
The linguistic patterns of usage of both unstructured and semi-structured text are analyzed in all supported languages to determine the range of formats used to refer to a patient's age or age demographic. The resulting grammar establishes a confidence measure to distinguish references to age as opposed to other information, and includes all elements of dates that allow the determination of age.
Dates
A large corpus of documents from public sources is processed to analyze the occurrence and format of dates. In this way, coverage of all common and less-common formats is built up, while enabling a likelihood measure to indicate the confidence that the characters identified are a date of birth, rather than an unrelated date or other alphanumeric string.
Dates of any type that relate to an individual's healthcare, other than a single year, are covered by PHI regulation. The IDOL PHI Package allows determination of all such dates from analysis of linguistic patterns in all supported languages. In addition, the package can identify dates of particular types, such as date of death, and hospital admission and discharge dates.
Postal Codes
For each country, the publications of the national Postal Services are used as the authoritative source on the postal code.
In addition, testing against widely-gathered examples allows the identification and inclusion of non-standard formats and common errors (such as mixing the letter O with the digit 0), with an appropriately adjusted likelihood measure.
For countries where official sources are not available, public sources such as Wikipedia are used to source postal code formats.
Addresses and Locations
The identification of addresses consists of a number of steps, each of which is used as additional evidence that a piece of text represents a postal address. These are:
- The format of the text.
- The house number / street-name portion.
- The village / town / county / region portion.
- The postal code.
These components are not necessarily always present for a particular address, but each is taken as evidence that the text does indeed contain an address, combining to form an overall likelihood.
-
Few countries have prescribed formats for addresses, while most have conventions defined by the national Postal Service that is generally adhered to, but also frequently ignored.
The IDOL Web Connector is used to gather many millions of web documents to identify candidate addresses in each applicable country. From there, the variety of formats that are used in practice are identified. In addition, any recommendations published by the national Postal Services are also used. The Universal Postal Union and other reputable sources are also used to generate and confirm address formats.
-
For the street-address portion, the extensive OpenStreetMap project is used, and a database of every named street in each of the supported countries is obtained and analyzed. From this database, rules are derived to allow the identification of the vast majority of street-address strings.
-
The de facto authority for geographical place names is the GeoNames database, with 11 million locations identified by data including country, population and type. In particular the type field is used to generate complete lists of populated settlements and administrative regions (such as county / department / region ) for the countries that frequently use those in addresses. In addition, the names are available in different character sets and transliteration schemes to ensure internationalization.
Other official sources are also used to generate city, town, and region lists.
-
The patterns derived for matching Postal Codes are also used here (see Postal Codes).
The patterns are tested by performing Eduction on address lists generated from various online sources to ensure that recall is sufficiently high, to provide confidence that each address format is correct. These lists are also used to adjust the address format if required. In addition, the address grammars are tested against other public sources, such as Wikipedia articles, to ensure that the address formats do not return too many false positives.
For locations, the IDOL PHI Package identifies any address portion smaller than a state.
Telephone Number
The general schemes for the creation of telephone numbers and fax numbers are readily available from the appropriate government department of each country. However, the formats of such numbers when written down varies considerably within a country, and even more so when numbers are referred to in a foreign document.
The strategy for creating comprehensive phone number matching grammars is centered on several key methods:
- Knowledge of the national scheme for assigning numbers.
- Databases of international and area codes in each country, obtained from authoritative sources.
- Analysis of many millions of examples of the usage of telephone numbers, obtained from a wide variety of public sources.
This final point is the most important. Only through examination of real-world usage of such numbers is the full range of formats obtained for each country.
The proximity of keywords indicating that the digits represent a telephone or fax number is used to strengthen the likelihood of the match.
Email Address
The IETF publications RFC 5321 and RFC 5322 define the standards of validity of email addresses, and so the IDOL PHI Package uses these for this purpose. In addition, it uses metrics of likelihood derived from the analysis of the most common email domains, to allow the grammars to differentiate between likely email addresses and those that are unlikely but still valid (for example, example@example.test).
IP Address/URL
The formats of IP Addresses are defined by the IETF in RFC 791 (IPv4) and RFC 4291 (IPv6) with later modifications. These allow the location of potentially identifiable information in candidate text by the IDOL PHI Package.
In the same way, Uniform Resource Locators (URL) were defined in RFC 1738.
National Identification Number
Each country has a different scheme for the use of National Identification. For countries with National ID cards, the format of the number is derived from governmental sources. In other countries, the formats of National Health, National Social Security, or National Insurance numbers are obtained from governmental sites, with the exception of a few cases in which other sources are used.
Vehicle Identifiers
Each country has a different scheme for the identification of vehicles by license plates. In each case, the national Vehicle Licensing Authority is used as the authoritative source of such information.
In each type, there are often standard and non-standard formats, with the former following a prescribed system more tightly. In the identification of such plates, a likelihood metric is used to take into account such formats and give a confidence that an identifier is actually a vehicle license.
Medical
Documents that contain mention of medical procedures or conditions are identified with the Medical categories, available in each of the supported languages. The categories are generated from the Medical Subject Headings (MeSH) taxonomy published by the United States National Library of Medicine using the C hierarchy (diseases and conditions). The medical terms are also extended using labels and aliases from public sources such as Wikipedia.
Profession
Documents that contain mention of an individual's occupation or profession are identified by the IDOL PHI Package in each supported language. The items are generated from an international database of over 60 million items.
Unique Device Identifiers
The IDOL PHI Package identifies Unique Device Identifiers (UDI) for medical devices. The formats match the standard formats issued by the three agencies accredited by the US Food and Drug Administration (FDA): GS1, HIBCC, and ICCBA.
Health Plan and Medical Record Numbers
The IDOL PHI Package identifies health plan numbers and Medical Record Numbers (MRN).
Health plan numbers have a standard format, which is readily available from governmental sources.
MRN formats differ for different healthcare provider. In this case, example formats are used to create as broad an identifier as possible, and landmark text is used to locate likely numbers. A grammar extension is also used to allow you to restrict the MRN detection to known formats, when you have specific formats you would like to detect. See Medical Record Numbers.
Birth Certificate Numbers
Birth certificate numbers have standard formats, which are readily available from governmental sources, such as the Social Security Administration (SSA).
Laboratory Numbers
The IDOL PHI Package identifies Clinical Laboratory Improvement Amendments (CLIA) laboratory numbers. The formats of these numbers is sourced from governmental sources, such as the Centres for Disease Control (CDC).
DEA
US Drug Enforcement Agency (DEA) numbers have a standard format. Official US Department of Justice sources, as well as secondary public sources, are used to define the format in the entities.
Accounts
The bank account and swift code formats have been compiled in existing Micro Focus Eduction grammars, and tested extensively against appropriate data.