Regular Expressions

This section describes the regular expressions syntax that Eduction supports.

The Eduction engine parser interprets regular expression syntax nearly identically to the UNIX regular expression syntax. The regular expression syntax also includes some extensions for matching substrings.

Operators

The following table describes the base regular expression operators available in the Eduction engine, and the pattern the operator matches.

Operator Matched Pattern
\ Quote the next metacharacter.
^ Match the beginning of a line.
$ Match the end of a line.
. Match any character (except newline).
| Alternation.
() Used for grouping to force operator precedence.
[xy] The character x or y.
[x-z] The range of characters between x and z.
[^z]

Any character except z.

NOTE: For performance reasons, Micro Focus recommends that you explicitly list all the characters that you want to match, rather than using this operator.

NOTE: To use negated character classes in case-insensitive entities, you must include letters in both cases, for example [^Zz] rather than [^z].

Quantifiers

Operator Matched Pattern
* Match 0 or more times.
+ Match 1 or more times.
? Match 0 or 1 times.
{n} Match exactly n times.
{n,} Match at least n times.
{n,m} Match at least n times, but no more than m times.

Metacharacters

Operator Matched Pattern
\t Match tab.
\n Match newline.
\r Match return.
\f Match formfeed.
\a Match alarm (bell, beep, and so on).
\e Match escape.
\v Match vertical tab.
\021 Match octal character (in this example, 21 octal).
\xF0 Match hex character (in this example, F0 hex).
\x{263a} Match wide hex character (Unicode).
\w Match word character: [A-Za-z0-9_].
\W Match non-word character: [^A-Za-z0-9_].
\s Match whitespace character. This metacharacter also includes \n and \r: [ \t\n\r].
\S Match non-whitespace character: [^ \t\n\r].
\d Match digit character: [0-9].
\D Match non-digit character: [^0-9].
\b Match word boundary.
\B Match non-word boundary.
\A Match start of string (never match at line breaks).
\Z Match end of string. Never match at line breaks; only match at the end of the final buffer of text submitted for matching.
\p{class} Match any character that belongs to the specified Unicode character class. For example, \p{Sc} matches any currency symbol. You can omit the braces for single-character class names: \p{C} and \pC are equivalent. For a list of supported character classes, see Supported Unicode Character Classes.
\P{class}

Match any character that does not belong to the specified Unicode character class. For example \P{Sc} matches any character that is not a currency symbol. You can omit the braces for single-character class names: \P{C} and \PC are equivalent. For a list of supported character classes, see Supported Unicode Character Classes.

NOTE: For performance reasons, Micro Focus recommends that you avoid using negated character classes where possible.

Extensions

Operator Matched Pattern
(?A:entity)

Match a previously defined entity, and copy it into the definition of the new entity.

For example:

<include path="number_types_eng.ecr"/>
    <entity name="fracpos" type="private">
       <pattern>(?A:number/fracalpha/eng)</pattern>
    </entity>

Copying an entity improves pattern execution speed, but increases compilation time and memory usage. Micro Focus recommends that you use reference (?A^ in all cases, unless your grammar file is very simple, and extraction speed is critical.

(?A^entity)

Match a previously defined entity, and reference it in the definition of the new entity.

Referencing an entity minimizes the size and memory usage of the grammar, but can decrease performance. The performance impact depends on the size and structure of the grammar.

(?A!expr)

Match the expression expr but exclude its output. This option designates an expression that helps identify an entity, but is not part of it.

For example:

<entity name="age_landmark" type="private">
   <pattern>Age:{0,1}\s*</pattern>
   <pattern>Años:{0,1}\s*</pattern>
</entity>
<entity name="age" type="public">
   <pattern>(?A!(?A^age_landmark))[1-9][0-9]?</pattern>

When you use this grammar to search the following text:

   Name: Simon. Age: 32. Address. 12 Fifth Street, Las Vegas.

The grammar returns the text 32 but ignores 12, because it does not have the prefix “Age:”, which is matched upon but excluded from the output.

(?A=component:expr)

Define a component in an entity definition. A component is a named part of an entity.

For example, the following grammar defines areacode and main as components:

<grammars>
   <grammar name="number">
      <entity name="phone" type="public">
         <pattern>(?A=areacode:[0-9]{3})-(?A=main:[0-9]{3}-[0-9]{4})</pattern>
      </entity>
   </grammar>
</grammars>

If the data contains the following phrase:

                               The phone number is 408-555-1342.
                        

and the following configuration options are set:

    OutputSimpleMatchInfo=false
    EnableComponents=true

then the output displays the areacode value 408 and the main value 555-1342 separately.