Text Match

Inherits From Value Extractor Namespace Grooper.Extract

Serves as the base class for value extractors that use regular expressions to locate and extract data from document text.

Remarks

The Text Match class provides a flexible and powerful foundation for building value extractors that rely on regular expressions.
It enables advanced pattern matching, prefix and suffix constraints, culture-aware extraction, and integration with Grooper's variable and lexicon system.

Overview

Text Match extractors are designed to locate and extract values from document text using regular expressions.
They support a wide range of scenarios, from simple keyword matching to complex, multi-line, and culture-specific patterns.

Key features include:

Prefix and Suffix Patterns:
Configure optional regular expressions that must occur immediately before or after each match, allowing for context-sensitive extraction.
Environment Options:
Inject merge variables and culture-specific lists into your patterns, supporting localization and reuse of regex snippets.
Case Sensitivity:
Control whether matching is case-sensitive, enabling extraction of values where capitalization is meaningful.
Text Preprocessing:
Manipulate control characters (such as line breaks, tabs, and spaces) to improve pattern matching across document structures.
Chunked Processing:
For large documents, break content into page chunks to optimize performance and manage memory usage.
Result Filtering and Output Options:
Apply post-processing to extracted results, including normalization, confidence adjustment, sorting, and filtering.

Configuration Guidance

Define your main regular expression pattern in a derived extractor.
Use 'Prefix Pattern' and 'Suffix Pattern' to add context constraints.
Configure 'Environment Options' to inject variables and control culture settings.
Enable or disable 'Case Sensitive' matching as needed for your scenario.
Adjust 'Chunk Size' for optimal performance on large documents.
Use 'Result Filter' and 'Result Set Options' to shape the output for downstream use.

Usage Scenarios

Extracting labeled values:
Use prefix patterns to require a label (e.g., Invoice No: ?) before the value.
Culture-aware extraction:
Inject culture-specific lists (e.g., day or month names) using merge variables.
Multi-line and tabular data:
Preprocess text to handle line breaks, tabs, and paragraph boundaries for robust extraction.
Performance optimization:
Enable chunked processing for documents with hundreds or thousands of pages.

Diagnostics

When diagnostic logging is enabled, Text Match records detailed information about the extraction process:

Regular Expression Log:
Logs the compiled regular expression used for extraction, including injected variables and culture settings.
Chunking Log:
Records chunk boundaries and extraction steps for large documents.
Result Filtering Log:
Details the application of result filters and output options.
Extraction Summary:
Reports the number of data instances produced and highlights any errors or validation issues.

These artifacts are accessible via the diagnostic interface and can be used to validate, troubleshoot, and tune your Text Match configuration.

Properties

Name Type Description

Matching

Prefix Pattern

String

►

Defines an optional prefix which must occur immediately before each match.

The 'Prefix Pattern' property allows you to specify a regular expression that must be present immediately before each match found by the Text Match extractor. This enables context-sensitive extraction, ensuring that only values preceded by a specific pattern, label, or structural element are returned.

Purpose

Use this property to restrict matches to those that occur after a particular label, whitespace, line break, or other context. This is especially useful for extracting labeled values, enforcing boundaries, or avoiding false positives in complex documents.

Configuration Guidance

Enter a regular expression that describes the required prefix. The pattern is applied directly before each match.
Common examples include whitespace, document labels, line or page boundaries, or exclusion of certain characters.
Use anchors (such as ^ for start of document, \f for page break, \n for line break) to target structural positions.
Combine multiple options using the | (OR) operator for flexibility.

Examples

Prefix Pattern	Description
`[\s-]`	Must be preceded by a whitespace character or hyphen.
`^`	[\s-]	Same as above, but also allows matches at the beginning.
`Invoice No: ?`	Must be preceded by the label `Invoice No:`.
`[^\\w]`	Cannot be preceded by a letter or digit.
`^`	\f	Matches must occur at the beginning of a page.
`^`	\n	Matches must occur at the beginning of a line/paragraph.
`\t`	Matches must occur after a large horizontal whitespace.

Impact

Only matches preceded by the specified pattern are included in the output.
Improves accuracy by filtering out values that do not occur in the desired context.
Can be used in combination with 'Suffix Pattern' for even more precise extraction.

Usage Scenarios

Extracting values labeled with a specific phrase (e.g., Invoice No:).
Restricting matches to the start of a page or line.
Avoiding matches that are embedded within words or unwanted regions.

> Use diagnostics to review which prefix patterns were applied and to troubleshoot extraction boundaries.

Suffix Pattern

String

►

Defines an optional suffix which must occur immediately after each match.

The 'Suffix Pattern' property allows you to specify a regular expression that must be present immediately after each match found by the Text Match extractor. This enables context-sensitive extraction, ensuring that only values followed by a specific pattern, label, or structural element are returned.

Purpose

Use this property to restrict matches to those that occur before a particular label, whitespace, line break, or other context. This is especially useful for extracting values with trailing units, enforcing boundaries, or avoiding false positives in complex documents.

Configuration Guidance

Enter a regular expression that describes the required suffix. The pattern is applied directly after each match.
Common examples include whitespace, document labels, line or page boundaries, or exclusion of certain characters.
Use anchors (such as $ for end of document, \f for page break, \r for line break) to target structural positions.
Combine multiple options using the | (OR) operator for flexibility.

Examples

Suffix Pattern	Description
`[\s:.-]`	Must be followed by a whitespace, colon, period, hyphen.
`$`	[\s:.-]	Same as above, but also allows matches at the end.
`[ ]acres`	Must be followed by `acres`.
`[^\\w]`	Cannot be followed by a letter or digit.
`$`	\f	Matches must occur at the end of a page.
`$`	\r	Matches must occur at the end of a line/paragraph.

Impact

Only matches followed by the specified pattern are included in the output.
Improves accuracy by filtering out values that do not occur in the desired context.
Can be used in combination with 'Prefix Pattern' for even more precise extraction.

Usage Scenarios

Extracting values with trailing units or labels (e.g., acres).
Restricting matches to the end of a page or line.
Avoiding matches that are embedded within words or unwanted regions.

> Use diagnostics to review which suffix patterns were applied and to troubleshoot extraction boundaries.

Environment

Environment Options

►

Provides configuration for merge variables and culture settings used by regex-based extractors.

Options

Case Sensitive

Boolean

►

Specifies whether matching should be performed in a case-sensitive manner.

The 'Case Sensitive' property controls whether the regular expression pattern, prefix, and suffix matching performed by the Text Match extractor will distinguish between uppercase and lowercase letters.

Purpose

Enable this property when the capitalization of text is meaningful for your extraction scenario, such as distinguishing between proper names, acronyms, or case-specific labels.

Configuration Guidance

Set to true to require exact case matches (e.g., "Invoice" ≠ "invoice").
Set to false to allow matches regardless of case (e.g., "Invoice", "INVOICE", and "invoice" are all equivalent).
Use case-sensitive matching for scenarios where capitalization conveys meaning, such as extracting section headings, entity names, or codes that are case-dependent.
For most business data, case-insensitive matching (false) is recommended to maximize extraction accuracy.

Impact

When enabled, only text that matches the exact case of the pattern will be extracted.
When disabled, matches will be found regardless of case, increasing the number of potential hits.

Examples

Pattern	Case Sensitive	Matches	Non-Matches
`[A-Z]+`	`true`	`INVOICE`	`invoice`
`[a-z]+`	`true`	`invoice`	`INVOICE`
`[A-Z][a-z]*`	`true`	`Invoice`	`INVOICE`, `invoice`
`[A-Z][A-Za-z]*`	`true`	`Invoice`, `INVOICE`	`invoice`

> Use diagnostics to verify which matches were found and to troubleshoot case-related extraction issues.

Preprocessing

Text Preprocessor

►

Applies configurable text preprocessing to a document's content before regular expression extraction.

The Text Preprocessor enables advanced manipulation of control characters in a document's text, allowing regular expressions to match or ignore structural elements such as line breaks, paragraph boundaries, page breaks, tabs, and spaces.

Overview

Text preprocessing is performed immediately before extraction, transforming the document's text to improve the accuracy and flexibility of pattern matching. This is especially useful when data values span multiple lines, are separated by large whitespace gaps, or are affected by inconsistent formatting.

Key Features

Paragraph Marking:
Detects paragraph boundaries and converts line breaks within paragraphs to spaces, while preserving paragraph-ending breaks. This allows extractors to match values that span multiple lines within a paragraph, without matching across paragraph boundaries. See Paragraph Marker.
Tab Marking:
Replaces large horizontal whitespace gaps with TAB characters, making it possible to distinguish between normal spaces and significant gaps in regular expressions. See Horizontal Tab Marker.
Vertical Tab Marking:
Converts certain line breaks to vertical tab characters based on vertical spacing, enabling recognition of vertical structure in tabular or multi-column layouts. See Vertical Tab Marker.
Control Character Ignoring:
Removes or replaces selected control characters (such as spaces, newlines, form feeds, and carriage returns) according to the 'Ignore Control Characters' setting. This can simplify extraction in documents with inconsistent or excessive whitespace.

Usage Guidance

Configure the desired preprocessing options by enabling or disabling paragraph, tab, and vertical tab marking, and by selecting which control characters to ignore.
Preprocessing is typically used in conjunction with regular expression-based extractors, but can benefit any extraction scenario where document structure affects pattern matching.
For best results, adjust preprocessing settings to match the structure and formatting of your source documents.

Example Scenarios

Extracting values that span multiple lines within a paragraph:
Enable paragraph marking to convert internal line breaks to spaces, allowing regular expressions to match values split across lines.
Distinguishing between normal spaces and large gaps:
Enable tab marking to insert TAB characters at significant horizontal gaps, so extractors can target fields separated by large whitespace.
Cleaning up unwanted whitespace or control characters:
Use the 'Ignore Control Characters' option to remove or replace problematic characters that interfere with extraction.

For more details, see the documentation for Paragraph Marker, Horizontal Tab Marker, and Vertical Tab Marker.

Examples

1. Sample Document

Consider the following sample document.

┌─────────────────────────────────────────────────────────────┐
│                        SAMPLE FORM                          │
├─────────────────────────────────────────────────────────────┤
│ Name:           John Doe                   ID: 12345        │
│ Date of Birth:  01/01/1980                 Status: Active   │
├─────────────────────────────────────────────────────────────┤
│ This is the first paragraph. It explains the purpose of     │
│ the form and the meaning of each field.                     │
│                                                             │
│ Please complete all fields and verify all personal          │
│ information before submitting. Thank you!                   │ 
└─────────────────────────────────────────────────────────────┘

2. Default Control Characters

With no preprocessing options enabled, the document data will look like this. Whitespace gaps, no matter how large, are represented by a single space character. A \r\n pair marks each location where the original document wrapped to the next line.

SAMPLE FORM\r\n
Name: John Doe ID: 12345\r\n
Date of Birth: 01/01/1980 Status: Active\r\n
This is the first paragraph. It explains the purpose of\r\n
the form and the meaning of each field.\r\n
Please complete all fields and verify all personal\r\n
information before submitting. Thank you!\r\n

3. Preprocessed Version

Preprocessing the document with paragraph marking and tab marking will place a tab character '\t' at each large whitespace gap, and replace newline pairs '\r\n' occuring inside a paragraph with a space.

SAMPLE FORM\r\n
Name: John Doe\tID: 12345\r\n
Date of Birth: 01/01/1980\tStatus: Active\r\n
This is the first paragraph. the form and the meaning of each field.\r\n
Please complete all fields and verify all personal information before submitting. Thank you!\r\n

Chunk Size

Int32

►

The chunk size, in pages, to use when processing large documents.

Output

Result Filter

►

Defines rules for filtering the result set produced by extraction operations.

Result Set Options

►

Configures post-processing options for a set of extracted results, enabling value normalization, confidence adjustment, sorting, filtering, and other result set controls.

Derived Types

There are 5 implementations of Text Match.

Field Match	Matches the value stored in a previously-extracted field or table column.
Label Match	Matches a list of one or more label values, using matching options defined by a Labeling Behavior.
List Match	Extracts values from document text that match any entry in a list of search terms.
Pattern Match	Extracts values from document text that match a specified regular expression pattern.
Word Match	Extracts individual words and multi-word phrases (N-grams) from document text for use in classification, data extraction, and normalization.

Used By

Document Type Extract From Data Column Data Field Lexical Rules-Based Spell Corrector Auto Complete Settings Paragraph Marker Metadata Options OCR Layer Line Periodicity Detector Fixed Width Labeled Value Select Page Data Type OCR Reader Divider Anchor Simple

Text Match

Remarks

Overview

Configuration Guidance

Usage Scenarios

Diagnostics

Properties

Purpose

Configuration Guidance

Examples

Impact

Usage Scenarios

Purpose

Configuration Guidance

Examples

Impact

Usage Scenarios

Overview

Built-In Merge Variables

Custom Merge Variables

Culture Settings

Usage Guidance

Purpose

Configuration Guidance

Impact

Examples

Overview

Key Features

Usage Guidance

Example Scenarios

Examples

1. Sample Document

2. Default Control Characters

3. Preprocessed Version

Purpose

Configuration Guidance

Impact

Usage Scenarios

Configuration and Usage

Typical Scenarios

Related Types

Overview

Key Scenarios

Processing Flow

Usage Guidance

Derived Types

See Also

Used By