Word Match

Inherits From Text Match Namespace Grooper.Extract

Extracts individual words and multi-word phrases (N-grams) from document text for use in classification, data extraction, and normalization.

Remarks

Word Match is designed to locate and output single words or contiguous multi-word phrases from document text. It is a foundational tool for both data extraction and document classification in Grooper, enabling the identification of context-rich features such as names, titles, and key phrases.

What It's For

The primary purpose of Word Match is to break down document text into meaningful units—words and phrases—that can be used as features for classification, or as extracted values for data fields. By capturing not just individual words, but also sequences of words (N-grams), Word Match provides richer context for downstream processes. For example, the phrase "remittance advice" is more informative than the words "remittance" and "advice" considered separately.

Typical use cases include:

Extracting person names, organization names, or other multi-word entities.
Generating features for machine learning-based classification.
Normalizing and validating extracted values against vocabularies or lists.
Supporting advanced scenarios such as correcting OCR errors or handling multilingual documents.

How It Works

Word Match operates in two main steps:

Word Identification:
The extractor scans the document text and identifies words using a regular expression. This allows for language-specific, length-specific, or error-tolerant matching.
Phrase Assembly:
Adjacent words are grouped into phrases (N-grams) of configurable length. All possible contiguous N-grams are produced, subject to join rules and optional lookups or validation.

The output consists of all valid words and phrases found in the text, ready for use in classification, extraction, or normalization workflows. Phrase extraction is especially valuable for scenarios where context matters, such as distinguishing between "John Doe" and "Doe John", or identifying key phrases for document categorization.

Integration and Advanced Scenarios

Word Match integrates seamlessly with Grooper's extraction and classification system. It supports:

Validation and normalization of extracted values using vocabularies and lookups.
Custom output formatting for phrases.
Handling of OCR errors and multilingual content.
Use in both field extraction and classification activities.

Diagnostics

When diagnostic logging is enabled, Word Match produces artifacts that help validate and troubleshoot extraction:

Logs of matched words and phrases.
Timing and performance metrics for extraction steps.
Summaries of results and any validation issues.

Properties

Name Type Description

Matching

Word Pattern

String

►

A regular expression pattern used to identify individual words in document text.

The 'Word Pattern' property defines the rule for what constitutes a "word" during extraction. This pattern is applied to the document text to locate sequences of characters that should be treated as words for further processing and phrase assembly.

How It Works

The extractor scans the text and matches all substrings that fit the regular expression provided.
By default, the pattern '\p{L}+' matches any sequence of Unicode letters, supporting words in any language.
You can customize the pattern to restrict matches by language, character set, or length, or to tolerate common OCR errors.

Impact

The choice of pattern directly affects which words are extracted and how phrases are assembled.
A broad pattern may capture unwanted fragments, while a restrictive pattern may miss valid words.
For OCR documents, patterns can be adapted to match common misreadings (e.g., digits in place of letters).

Examples

Pattern	Description
'\p{L}+'	Any word of letters in any language
'[A-Z]+'	Uppercase English words
'[A-Z]{3,16}'	English words, 3-16 characters long
'[A-Z][A-Z0158$]{2,15}'	OCR-tolerant pattern for "INVOICE"

Usage Scenarios

Extracting words for classification or feature generation.
Identifying person names, codes, or key terms in structured or unstructured text.
Improving extraction accuracy for OCR-based documents by accounting for common errors.

> Use diagnostics to review which words were matched and to tune the pattern for your data.

Prefix Pattern

String

►

Defines an optional prefix which must occur immediately before each match.

The 'Prefix Pattern' property allows you to specify a regular expression that must be present immediately before each match found by the Text Match extractor. This enables context-sensitive extraction, ensuring that only values preceded by a specific pattern, label, or structural element are returned.

Purpose

Use this property to restrict matches to those that occur after a particular label, whitespace, line break, or other context. This is especially useful for extracting labeled values, enforcing boundaries, or avoiding false positives in complex documents.

Configuration Guidance

Enter a regular expression that describes the required prefix. The pattern is applied directly before each match.
Common examples include whitespace, document labels, line or page boundaries, or exclusion of certain characters.
Use anchors (such as ^ for start of document, \f for page break, \n for line break) to target structural positions.
Combine multiple options using the | (OR) operator for flexibility.

Examples

Prefix Pattern	Description
`[\s-]`	Must be preceded by a whitespace character or hyphen.
`^`	[\s-]	Same as above, but also allows matches at the beginning.
`Invoice No: ?`	Must be preceded by the label `Invoice No:`.
`[^\\w]`	Cannot be preceded by a letter or digit.
`^`	\f	Matches must occur at the beginning of a page.
`^`	\n	Matches must occur at the beginning of a line/paragraph.
`\t`	Matches must occur after a large horizontal whitespace.

Impact

Only matches preceded by the specified pattern are included in the output.
Improves accuracy by filtering out values that do not occur in the desired context.
Can be used in combination with 'Suffix Pattern' for even more precise extraction.

Usage Scenarios

Extracting values labeled with a specific phrase (e.g., Invoice No:).
Restricting matches to the start of a page or line.
Avoiding matches that are embedded within words or unwanted regions.

> Use diagnostics to review which prefix patterns were applied and to troubleshoot extraction boundaries.

Suffix Pattern

String

►

Defines an optional suffix which must occur immediately after each match.

The 'Suffix Pattern' property allows you to specify a regular expression that must be present immediately after each match found by the Text Match extractor. This enables context-sensitive extraction, ensuring that only values followed by a specific pattern, label, or structural element are returned.

Purpose

Use this property to restrict matches to those that occur before a particular label, whitespace, line break, or other context. This is especially useful for extracting values with trailing units, enforcing boundaries, or avoiding false positives in complex documents.

Configuration Guidance

Enter a regular expression that describes the required suffix. The pattern is applied directly after each match.
Common examples include whitespace, document labels, line or page boundaries, or exclusion of certain characters.
Use anchors (such as $ for end of document, \f for page break, \r for line break) to target structural positions.
Combine multiple options using the | (OR) operator for flexibility.

Examples

Suffix Pattern	Description
`[\s:.-]`	Must be followed by a whitespace, colon, period, hyphen.
`$`	[\s:.-]	Same as above, but also allows matches at the end.
`[ ]acres`	Must be followed by `acres`.
`[^\\w]`	Cannot be followed by a letter or digit.
`$`	\f	Matches must occur at the end of a page.
`$`	\r	Matches must occur at the end of a line/paragraph.

Impact

Only matches followed by the specified pattern are included in the output.
Improves accuracy by filtering out values that do not occur in the desired context.
Can be used in combination with 'Prefix Pattern' for even more precise extraction.

Usage Scenarios

Extracting values with trailing units or labels (e.g., acres).
Restricting matches to the end of a page or line.
Avoiding matches that are embedded within words or unwanted regions.

> Use diagnostics to review which suffix patterns were applied and to troubleshoot extraction boundaries.

Environment

Environment Options

►

Provides configuration for merge variables and culture settings used by regex-based extractors.

Options

Case Sensitive

Boolean

►

Specifies whether matching should be performed in a case-sensitive manner.

The 'Case Sensitive' property controls whether the regular expression pattern, prefix, and suffix matching performed by the Text Match extractor will distinguish between uppercase and lowercase letters.

Purpose

Enable this property when the capitalization of text is meaningful for your extraction scenario, such as distinguishing between proper names, acronyms, or case-specific labels.

Configuration Guidance

Set to true to require exact case matches (e.g., "Invoice" ≠ "invoice").
Set to false to allow matches regardless of case (e.g., "Invoice", "INVOICE", and "invoice" are all equivalent).
Use case-sensitive matching for scenarios where capitalization conveys meaning, such as extracting section headings, entity names, or codes that are case-dependent.
For most business data, case-insensitive matching (false) is recommended to maximize extraction accuracy.

Impact

When enabled, only text that matches the exact case of the pattern will be extracted.
When disabled, matches will be found regardless of case, increasing the number of potential hits.

Examples

Pattern	Case Sensitive	Matches	Non-Matches
`[A-Z]+`	`true`	`INVOICE`	`invoice`
`[a-z]+`	`true`	`invoice`	`INVOICE`
`[A-Z][a-z]*`	`true`	`Invoice`	`INVOICE`, `invoice`
`[A-Z][A-Za-z]*`	`true`	`Invoice`, `INVOICE`	`invoice`

> Use diagnostics to verify which matches were found and to troubleshoot case-related extraction issues.

Preprocessing

Text Preprocessor

►

Applies configurable text preprocessing to a document's content before regular expression extraction.

The Text Preprocessor enables advanced manipulation of control characters in a document's text, allowing regular expressions to match or ignore structural elements such as line breaks, paragraph boundaries, page breaks, tabs, and spaces.

Overview

Text preprocessing is performed immediately before extraction, transforming the document's text to improve the accuracy and flexibility of pattern matching. This is especially useful when data values span multiple lines, are separated by large whitespace gaps, or are affected by inconsistent formatting.

Key Features

Paragraph Marking:
Detects paragraph boundaries and converts line breaks within paragraphs to spaces, while preserving paragraph-ending breaks. This allows extractors to match values that span multiple lines within a paragraph, without matching across paragraph boundaries. See Paragraph Marker.
Tab Marking:
Replaces large horizontal whitespace gaps with TAB characters, making it possible to distinguish between normal spaces and significant gaps in regular expressions. See Horizontal Tab Marker.
Vertical Tab Marking:
Converts certain line breaks to vertical tab characters based on vertical spacing, enabling recognition of vertical structure in tabular or multi-column layouts. See Vertical Tab Marker.
Control Character Ignoring:
Removes or replaces selected control characters (such as spaces, newlines, form feeds, and carriage returns) according to the 'Ignore Control Characters' setting. This can simplify extraction in documents with inconsistent or excessive whitespace.

Usage Guidance

Configure the desired preprocessing options by enabling or disabling paragraph, tab, and vertical tab marking, and by selecting which control characters to ignore.
Preprocessing is typically used in conjunction with regular expression-based extractors, but can benefit any extraction scenario where document structure affects pattern matching.
For best results, adjust preprocessing settings to match the structure and formatting of your source documents.

Example Scenarios

Extracting values that span multiple lines within a paragraph:
Enable paragraph marking to convert internal line breaks to spaces, allowing regular expressions to match values split across lines.
Distinguishing between normal spaces and large gaps:
Enable tab marking to insert TAB characters at significant horizontal gaps, so extractors can target fields separated by large whitespace.
Cleaning up unwanted whitespace or control characters:
Use the 'Ignore Control Characters' option to remove or replace problematic characters that interfere with extraction.

For more details, see the documentation for Paragraph Marker, Horizontal Tab Marker, and Vertical Tab Marker.

Examples

1. Sample Document

Consider the following sample document.

┌─────────────────────────────────────────────────────────────┐
│                        SAMPLE FORM                          │
├─────────────────────────────────────────────────────────────┤
│ Name:           John Doe                   ID: 12345        │
│ Date of Birth:  01/01/1980                 Status: Active   │
├─────────────────────────────────────────────────────────────┤
│ This is the first paragraph. It explains the purpose of     │
│ the form and the meaning of each field.                     │
│                                                             │
│ Please complete all fields and verify all personal          │
│ information before submitting. Thank you!                   │ 
└─────────────────────────────────────────────────────────────┘

2. Default Control Characters

With no preprocessing options enabled, the document data will look like this. Whitespace gaps, no matter how large, are represented by a single space character. A \r\n pair marks each location where the original document wrapped to the next line.

SAMPLE FORM\r\n
Name: John Doe ID: 12345\r\n
Date of Birth: 01/01/1980 Status: Active\r\n
This is the first paragraph. It explains the purpose of\r\n
the form and the meaning of each field.\r\n
Please complete all fields and verify all personal\r\n
information before submitting. Thank you!\r\n

3. Preprocessed Version

Preprocessing the document with paragraph marking and tab marking will place a tab character '\t' at each large whitespace gap, and replace newline pairs '\r\n' occuring inside a paragraph with a space.

SAMPLE FORM\r\n
Name: John Doe\tID: 12345\r\n
Date of Birth: 01/01/1980\tStatus: Active\r\n
This is the first paragraph. the form and the meaning of each field.\r\n
Please complete all fields and verify all personal information before submitting. Thank you!\r\n

Word Lookup

Value Lookup

►

An optional lookup used to validate, normalize, or correct individual words after extraction.

Chunk Size

Int32

►

The chunk size, in pages, to use when processing large documents.

Phrases

Phrase Size

Int32

►

Specifies the number of words to include in each extracted phrase (N-gram).

Can be one of the following types:

Value	Description
1
2
3
4
5

The 'Phrase Size' property determines how many adjacent words are grouped together to form a phrase, also known as an N-gram. This enables the extractor to capture not only single words, but also meaningful multi-word combinations that provide richer context for classification and data extraction.

How It Works

When set to 1, the extractor outputs single words (unigrams).
When set to 2, it outputs all possible two-word phrases (bigrams).
Higher values (up to 5) produce longer phrases, such as trigrams (3), four-grams (4), and five-grams (5).
All contiguous combinations of the specified size are produced, subject to join rules and validation.

Impact

Larger phrase sizes capture more context, which can improve classification accuracy and enable extraction of multi-word entities (e.g., "John Doe", "remittance advice").
The number of output phrases increases with phrase size, especially in longer documents.
Phrase extraction is valuable for identifying key terms, names, or features that span multiple words.

Examples

For the text 'the quick brown fox':

Phrase Size 1: 'the', 'quick', 'brown', 'fox'
Phrase Size 2: 'the quick', 'quick brown', 'brown fox'
Phrase Size 3: 'the quick brown', 'quick brown fox'

Usage Scenarios

Extracting person names, organization names, or other multi-word entities.
Generating N-gram features for machine learning-based classification.
Identifying key phrases for document categorization or search.

> Use diagnostics to review the number and type of phrases produced, and adjust phrase size for your scenario.

Join Pattern

String

►

A regular expression pattern that determines whether two words can be joined together to form a phrase.

Term Options

Group Options[]

►

Configures per-term lookup and normalization options for each word in a phrase (N-gram).

Phrase Lookup

Value Lookup

►

An optional lookup used to validate, normalize, or correct the entire phrase (N-gram) after assembly.

Minimum Term Hits

Int32

►

Specifies the minimum number of term lookups that must succeed for a phrase to be considered valid.

Output

Output Format

String

►

An optional format string that transforms the final output value for each phrase.

Result Filter

►

Defines rules for filtering the result set produced by extraction operations.

Result Set Options

►

Configures post-processing options for a set of extracted results, enabling value normalization, confidence adjustment, sorting, filtering, and other result set controls.

Used By

Document Type Extract From Data Column Data Field Lexical Rules-Based Spell Corrector Auto Complete Settings Paragraph Marker Metadata Options OCR Layer Line Periodicity Detector Fixed Width Labeled Value Select Page Data Type OCR Reader Divider Anchor Simple

Word Match

Remarks

What It's For

How It Works

Integration and Advanced Scenarios

Diagnostics

Properties

How It Works

Impact

Examples

Usage Scenarios

Purpose

Configuration Guidance

Examples

Impact

Usage Scenarios

Purpose

Configuration Guidance

Examples

Impact

Usage Scenarios

Overview

Built-In Merge Variables

Custom Merge Variables

Culture Settings

Usage Guidance

Purpose

Configuration Guidance

Impact

Examples

Overview

Key Features

Usage Guidance

Example Scenarios

Examples

1. Sample Document

2. Default Control Characters

3. Preprocessed Version

How It Works

Impact

Examples

Usage Scenarios

Purpose

Configuration Guidance

Impact

Usage Scenarios

How It Works

Impact

Examples

Usage Scenarios

How It Works

Impact

Examples

Usage Scenarios

How It Works

Impact

Examples

Usage Scenarios

How It Works

Impact

Examples

Usage Scenarios

How It Works

Impact

Examples

Usage Scenarios

How It Works

Impact

Examples

Usage Scenarios

Configuration and Usage

Typical Scenarios

Related Types

Overview

Key Scenarios

Processing Flow

Usage Guidance

See Also

Used By