Label Match

Inherits From List Match Namespace Grooper.Extract

Matches a list of one or more label values, using matching options defined by a Labeling Behavior.

Remarks

The Label Match extractor is designed to identify field labels, headers, or entity names in document text by matching against a configurable list of terms. It is especially useful for scenarios where labels may appear in many spelling, formatting, or layout variations, and where consistent extraction logic is needed across multiple fields or document types.

How It Works

The extractor uses the 'Vocabulary' property to define the set of label terms to match. These can be entered directly or referenced from external lexicons.
Matching options such as fuzzy matching, vertical wrap, and constrained wrap are inherited from the associated Labeling Behavior, allowing centralized configuration and reuse.
When 'Translate' is enabled and 'Vocabulary' is configured as a lookup lexicon, matched values can be normalized or replaced with standardized forms.
The extractor supports detection of labels split across multiple lines (vertical wrap) or restricted to specific regions (constrained wrap), as configured in the Labeling Behavior.
Matching is case-sensitive by default and uses preprocessing to improve accuracy.

Configuration Guidance

Define all expected label variants in the 'Vocabulary' property, including alternate spellings, abbreviations, and formatting differences.
Use a Labeling Behavior to manage fuzzy matching, vertical wrap, and constrained wrap options centrally. This ensures consistent label extraction across all fields and extractors that reference the behavior.
Enable 'Translate' and configure a lookup lexicon to normalize matched labels to a single output value, improving consistency for downstream processing.
For documents with complex layouts, adjust vertical and constrained wrap settings to capture labels that span multiple lines or regions.

Usage Scenarios

Field Label Extraction:
Extract field labels from forms, tables, or semi-structured documents, even when labels are wrapped across lines or appear with minor OCR errors.
Consistent Labeling Across Fields:
Apply a single Labeling Behavior to multiple Label Match extractors to ensure consistent handling of fuzzy matching and wrapping options throughout a project.
Entity Name Normalization:
Use translation to map multiple label variants (e.g., "International Business Machines", "IBM Corporation") to a single normalized value ("IBM").

Advanced Features

Fuzzy Matching:
Tolerates minor OCR or typographical errors, increasing recall in noisy or variable documents.
Vertical and Constrained Wrapping:
Detects labels split across lines or restricted to specific regions, improving extraction in complex layouts.
Case Sensitivity and Preprocessing:
Ensures accurate matching by respecting case and applying text normalization before extraction.

Practical Tips

Regularly review and update the vocabulary to ensure all relevant label variants are included.
Test extraction with representative document samples to verify matching behavior and adjust settings as needed.
Use diagnostic logs to troubleshoot missed or incorrect matches, and refine vocabulary or behavior settings for optimal results.

Properties

Name Type Description

Matching

Local Entries

String

►

Specifies the local list of search terms to match in the document.

Vocabulary

List Match Entries

►

Specifies the vocabulary of search terms or key-value pairs used for matching and translation.

Prefix Pattern

String

►

Defines an optional prefix which must occur immediately before each match.

The 'Prefix Pattern' property allows you to specify a regular expression that must be present immediately before each match found by the Text Match extractor. This enables context-sensitive extraction, ensuring that only values preceded by a specific pattern, label, or structural element are returned.

Purpose

Use this property to restrict matches to those that occur after a particular label, whitespace, line break, or other context. This is especially useful for extracting labeled values, enforcing boundaries, or avoiding false positives in complex documents.

Configuration Guidance

Enter a regular expression that describes the required prefix. The pattern is applied directly before each match.
Common examples include whitespace, document labels, line or page boundaries, or exclusion of certain characters.
Use anchors (such as ^ for start of document, \f for page break, \n for line break) to target structural positions.
Combine multiple options using the | (OR) operator for flexibility.

Examples

Prefix Pattern	Description
`[\s-]`	Must be preceded by a whitespace character or hyphen.
`^`	[\s-]	Same as above, but also allows matches at the beginning.
`Invoice No: ?`	Must be preceded by the label `Invoice No:`.
`[^\\w]`	Cannot be preceded by a letter or digit.
`^`	\f	Matches must occur at the beginning of a page.
`^`	\n	Matches must occur at the beginning of a line/paragraph.
`\t`	Matches must occur after a large horizontal whitespace.

Impact

Only matches preceded by the specified pattern are included in the output.
Improves accuracy by filtering out values that do not occur in the desired context.
Can be used in combination with 'Suffix Pattern' for even more precise extraction.

Usage Scenarios

Extracting values labeled with a specific phrase (e.g., Invoice No:).
Restricting matches to the start of a page or line.
Avoiding matches that are embedded within words or unwanted regions.

> Use diagnostics to review which prefix patterns were applied and to troubleshoot extraction boundaries.

Suffix Pattern

String

►

Defines an optional suffix which must occur immediately after each match.

The 'Suffix Pattern' property allows you to specify a regular expression that must be present immediately after each match found by the Text Match extractor. This enables context-sensitive extraction, ensuring that only values followed by a specific pattern, label, or structural element are returned.

Purpose

Use this property to restrict matches to those that occur before a particular label, whitespace, line break, or other context. This is especially useful for extracting values with trailing units, enforcing boundaries, or avoiding false positives in complex documents.

Configuration Guidance

Enter a regular expression that describes the required suffix. The pattern is applied directly after each match.
Common examples include whitespace, document labels, line or page boundaries, or exclusion of certain characters.
Use anchors (such as $ for end of document, \f for page break, \r for line break) to target structural positions.
Combine multiple options using the | (OR) operator for flexibility.

Examples

Suffix Pattern	Description
`[\s:.-]`	Must be followed by a whitespace, colon, period, hyphen.
`$`	[\s:.-]	Same as above, but also allows matches at the end.
`[ ]acres`	Must be followed by `acres`.
`[^\\w]`	Cannot be followed by a letter or digit.
`$`	\f	Matches must occur at the end of a page.
`$`	\r	Matches must occur at the end of a line/paragraph.

Impact

Only matches followed by the specified pattern are included in the output.
Improves accuracy by filtering out values that do not occur in the desired context.
Can be used in combination with 'Prefix Pattern' for even more precise extraction.

Usage Scenarios

Extracting values with trailing units or labels (e.g., acres).
Restricting matches to the end of a page or line.
Avoiding matches that are embedded within words or unwanted regions.

> Use diagnostics to review which suffix patterns were applied and to troubleshoot extraction boundaries.

Environment

Environment Options

►

Provides configuration for merge variables and culture settings used by regex-based extractors.

Options

Case Sensitive

Boolean

►

Specifies whether matching should be performed in a case-sensitive manner.

The 'Case Sensitive' property controls whether the regular expression pattern, prefix, and suffix matching performed by the Text Match extractor will distinguish between uppercase and lowercase letters.

Purpose

Enable this property when the capitalization of text is meaningful for your extraction scenario, such as distinguishing between proper names, acronyms, or case-specific labels.

Configuration Guidance

Set to true to require exact case matches (e.g., "Invoice" ≠ "invoice").
Set to false to allow matches regardless of case (e.g., "Invoice", "INVOICE", and "invoice" are all equivalent).
Use case-sensitive matching for scenarios where capitalization conveys meaning, such as extracting section headings, entity names, or codes that are case-dependent.
For most business data, case-insensitive matching (false) is recommended to maximize extraction accuracy.

Impact

When enabled, only text that matches the exact case of the pattern will be extracted.
When disabled, matches will be found regardless of case, increasing the number of potential hits.

Examples

Pattern	Case Sensitive	Matches	Non-Matches
`[A-Z]+`	`true`	`INVOICE`	`invoice`
`[a-z]+`	`true`	`invoice`	`INVOICE`
`[A-Z][a-z]*`	`true`	`Invoice`	`INVOICE`, `invoice`
`[A-Z][A-Za-z]*`	`true`	`Invoice`, `INVOICE`	`invoice`

> Use diagnostics to verify which matches were found and to troubleshoot case-related extraction issues.

Preprocessing

Text Preprocessor

►

Applies configurable text preprocessing to a document's content before regular expression extraction.

The Text Preprocessor enables advanced manipulation of control characters in a document's text, allowing regular expressions to match or ignore structural elements such as line breaks, paragraph boundaries, page breaks, tabs, and spaces.

Overview

Text preprocessing is performed immediately before extraction, transforming the document's text to improve the accuracy and flexibility of pattern matching. This is especially useful when data values span multiple lines, are separated by large whitespace gaps, or are affected by inconsistent formatting.

Key Features

Paragraph Marking:
Detects paragraph boundaries and converts line breaks within paragraphs to spaces, while preserving paragraph-ending breaks. This allows extractors to match values that span multiple lines within a paragraph, without matching across paragraph boundaries. See Paragraph Marker.
Tab Marking:
Replaces large horizontal whitespace gaps with TAB characters, making it possible to distinguish between normal spaces and significant gaps in regular expressions. See Horizontal Tab Marker.
Vertical Tab Marking:
Converts certain line breaks to vertical tab characters based on vertical spacing, enabling recognition of vertical structure in tabular or multi-column layouts. See Vertical Tab Marker.
Control Character Ignoring:
Removes or replaces selected control characters (such as spaces, newlines, form feeds, and carriage returns) according to the 'Ignore Control Characters' setting. This can simplify extraction in documents with inconsistent or excessive whitespace.

Usage Guidance

Configure the desired preprocessing options by enabling or disabling paragraph, tab, and vertical tab marking, and by selecting which control characters to ignore.
Preprocessing is typically used in conjunction with regular expression-based extractors, but can benefit any extraction scenario where document structure affects pattern matching.
For best results, adjust preprocessing settings to match the structure and formatting of your source documents.

Example Scenarios

Extracting values that span multiple lines within a paragraph:
Enable paragraph marking to convert internal line breaks to spaces, allowing regular expressions to match values split across lines.
Distinguishing between normal spaces and large gaps:
Enable tab marking to insert TAB characters at significant horizontal gaps, so extractors can target fields separated by large whitespace.
Cleaning up unwanted whitespace or control characters:
Use the 'Ignore Control Characters' option to remove or replace problematic characters that interfere with extraction.

For more details, see the documentation for Paragraph Marker, Horizontal Tab Marker, and Vertical Tab Marker.

Examples

1. Sample Document

Consider the following sample document.

┌─────────────────────────────────────────────────────────────┐
│                        SAMPLE FORM                          │
├─────────────────────────────────────────────────────────────┤
│ Name:           John Doe                   ID: 12345        │
│ Date of Birth:  01/01/1980                 Status: Active   │
├─────────────────────────────────────────────────────────────┤
│ This is the first paragraph. It explains the purpose of     │
│ the form and the meaning of each field.                     │
│                                                             │
│ Please complete all fields and verify all personal          │
│ information before submitting. Thank you!                   │ 
└─────────────────────────────────────────────────────────────┘

2. Default Control Characters

With no preprocessing options enabled, the document data will look like this. Whitespace gaps, no matter how large, are represented by a single space character. A \r\n pair marks each location where the original document wrapped to the next line.

SAMPLE FORM\r\n
Name: John Doe ID: 12345\r\n
Date of Birth: 01/01/1980 Status: Active\r\n
This is the first paragraph. It explains the purpose of\r\n
the form and the meaning of each field.\r\n
Please complete all fields and verify all personal\r\n
information before submitting. Thank you!\r\n

3. Preprocessed Version

Preprocessing the document with paragraph marking and tab marking will place a tab character '\t' at each large whitespace gap, and replace newline pairs '\r\n' occuring inside a paragraph with a space.

SAMPLE FORM\r\n
Name: John Doe\tID: 12345\r\n
Date of Birth: 01/01/1980\tStatus: Active\r\n
This is the first paragraph. the form and the meaning of each field.\r\n
Please complete all fields and verify all personal information before submitting. Thank you!\r\n

Fuzzy Matching

FRX Options

►

Specifies fuzzy matching options for a regular expression.

Can be one of the following types:

Value	Description
Enabled
Disabled

Unlike a normal regular expression, which finds values exactly matching the pattern, a fuzzy regular expression (FRX) finds values which match the pattern to a specific degree of similarity, and automatically repairs the output value whenever possible.

When using FRX mode, there are a few limitations on regular expression syntax and some performance implications which need to be considered. These are outlined below.

Regular Expression Syntax

Fuzzy regular expressions support most of the syntax and features of standard regular expressions, with a handful of exceptions noted below. The following regular expression features are NOT supported in FuzzyRegEx mode:

Quantifiers : + and *. Also, the 'few times as possible' construct (i.e. \w*?) is not supported.
Character Escapes : \a \b \e \nnn \cX \cx \unnnn.
Character Classes : \p{name} abd \P{name}
Grouping Constructs : Only basic named and unnamed group constructs are supported
Other: Anchors other than ^ and $, Backreference Constructs, and Alternation Constructs are unsupported.

FRX also supports an option which is unavailable in normal regular expressions. (?r) will turn on required mode, and (?-r) will turn it off. At the start of an FRX, required mode always defaults to off. Once turned on, required mode will stay on until it is turned off. This mechanism can be used, for example, to require the start of a new line. The syntax to accomplish this would be be (?r)\n(?-r).

Performance Considerations

The processing time for an FRX is considerably longer than a normal regular expression, particularly for complex regular expressions. The execution time is proportional to the perplexity of the regular expression - which measures the number of possible permutations in the pattern. For example:

A{1,2}B{1,2} has a perplexity of 2 * 2 = 4 (i.e. it could match AB, AAB, ABB, or AABB).
A{1,2}B{1,2}C{1,2} has a perplexity of 2 * 2 * 2 = 8.
A{1,5}B{1,5}C{1,5} has a perplexity of 5 * 5 * 5 = 125.
[0-9]{4} (miles|kilometers) has a perplexity of 1 * 2 = 2.
[0-9]{1,5} (miles|kilometers) has a perplexity of 5 * 2 = 10.

There is a point at which perplexity gets so high that fuzzy matching is computationally impractical. As such, FRX is not suitable for every extraction task, and should be used with caution.

Constrained Wrap

Constrained Wrap Options

►

Configures how text extraction handles values that wrap across multiple lines within a bounded region, such as a table cell or box.

Can be one of the following types:

Value	Description
Enabled
Disabled

The Constrained Wrap Options class enables extraction of values that span line breaks inside a defined region, such as a table cell or boxed area. This is useful for scenarios where data (like numbers, dates, or labels) may be split across lines due to formatting or limited space.

For example, enabling this option allows a pattern like \d+ acres to match "340 acres" in the following document, even though the value wraps across two lines:

A tract containing 340
acres situated in Caddo
County, Oklahoma.

Tract Information
TRACT #54784

Table headers also frequently wrap text inside a box, as shown below:

Date of
Service

Procedure
Code

Billed
Amount

Approved
Amount

How It Works

When enabled, this option combines the text content from a region (such as a table cell) into a single string, replacing line breaks with spaces. Extraction patterns are then applied to this combined text, allowing matches that span multiple lines.

You can further constrain which regions are considered by specifying minimum and maximum values for width, height, character count, and line count using the properties below.

Usage Guidance

Use Constrained Wrap Options when extracting data from documents where values may be split across lines within a bounded area.
Adjust the 'Width Range', 'Height Range', 'Character Count', and 'Line Count' properties to target only regions of interest and avoid false positives.
This option is especially useful for extracting data from table headers, boxed fields, or any layout where text wrapping is common.

Vertical Wrap

Vertical Wrap Detection

►

Configures detection of text segments that wrap vertically, enabling extraction of multi-line labels or values split across lines.

Can be one of the following types:

Value	Description
Enabled
Disabled

The Vertical Wrap Detection class enables extraction of search terms or values that are split across multiple lines in a vertical arrangement. This is especially useful for multi-word labels or values that may be wrapped due to document formatting, such as table headers or stacked field names.

For example, this option allows the extractor to find the search term "Purchase Order Number" in any of the following layouts:

The extractor groups vertically adjacent text segments that meet the alignment and spacing criteria defined by the properties below.
The combined text is compared to the set of search terms or values, allowing matches that span multiple lines.
You can control which segments are grouped by adjusting the maximum line spacing, alignment, and whether horizontal or vertical rules are allowed between lines.

Usage Guidance

Use Vertical Wrap Detection when extracting multi-line labels, table headers, or any field that may be split vertically.
Adjust the 'Maximum Line Spacing', 'Alignment', and 'Alignment Tolerance' properties to fine-tune which lines are grouped.
Enable or disable 'Allow Horizontal Rule' and 'Allow Vertical Rule' to control how lines separated by graphical elements are handled.

Chunk Size

Int32

►

The chunk size, in pages, to use when processing large documents.

Output

Use List Case

Boolean

►

Controls whether output values reflect the case of the matched document text or the case of the list entry.

Translate

Boolean

►

Enables translation of matched values to replacement values specified in the vocabulary.

Result Filter

►

Defines rules for filtering the result set produced by extraction operations.

Result Set Options

►

Configures post-processing options for a set of extracted results, enabling value normalization, confidence adjustment, sorting, filtering, and other result set controls.

Used By

Document Type Extract From Data Column Data Field Lexical Rules-Based Spell Corrector Auto Complete Settings Paragraph Marker Metadata Options OCR Layer Line Periodicity Detector Fixed Width Labeled Value Select Page Data Type OCR Reader Divider Anchor Simple

Label Match

Remarks

How It Works

Configuration Guidance

Usage Scenarios

Advanced Features

Practical Tips

Properties

Purpose

Configuration Guidance

Examples

Impact

Usage Scenarios

Purpose

Configuration Guidance

Examples

Impact

Usage Scenarios

Overview

Built-In Merge Variables

Custom Merge Variables

Culture Settings

Usage Guidance

Purpose

Configuration Guidance

Impact

Examples

Overview

Key Features

Usage Guidance

Example Scenarios

Examples

1. Sample Document

2. Default Control Characters

3. Preprocessed Version

Regular Expression Syntax

Performance Considerations

How It Works

Usage Guidance

Usage Guidance

Purpose

Configuration Guidance

Impact

Usage Scenarios

Configuration and Usage

Typical Scenarios

Related Types

Overview

Key Scenarios

Processing Flow

Usage Guidance

See Also

Used By