Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

Text Match

Value Extractor Grooper.Extract

Serves as the base class for value extractors that use regular expressions to locate and extract data from document text.

Remarks

The Text Match class provides a flexible and powerful foundation for building value extractors that rely on regular expressions.
It enables advanced pattern matching, prefix and suffix constraints, culture-aware extraction, and integration with Grooper's variable and lexicon system.

Overview

Text Match extractors are designed to locate and extract values from document text using regular expressions.
They support a wide range of scenarios, from simple keyword matching to complex, multi-line, and culture-specific patterns.

Key features include:

  • Prefix and Suffix Patterns:
    Configure optional regular expressions that must occur immediately before or after each match, allowing for context-sensitive extraction.
  • Environment Options:
    Inject merge variables and culture-specific lists into your patterns, supporting localization and reuse of regex snippets.
  • Case Sensitivity:
    Control whether matching is case-sensitive, enabling extraction of values where capitalization is meaningful.
  • Text Preprocessing:
    Manipulate control characters (such as line breaks, tabs, and spaces) to improve pattern matching across document structures.
  • Chunked Processing:
    For large documents, break content into page chunks to optimize performance and manage memory usage.
  • Result Filtering and Output Options:
    Apply post-processing to extracted results, including normalization, confidence adjustment, sorting, and filtering.

Configuration Guidance

  • Define your main regular expression pattern in a derived extractor.
  • Use 'Prefix Pattern' and 'Suffix Pattern' to add context constraints.
  • Configure 'Environment Options' to inject variables and control culture settings.
  • Enable or disable 'Case Sensitive' matching as needed for your scenario.
  • Adjust 'Chunk Size' for optimal performance on large documents.
  • Use 'Result Filter' and 'Result Set Options' to shape the output for downstream use.

Usage Scenarios

  • Extracting labeled values:
    Use prefix patterns to require a label (e.g., Invoice No: ?) before the value.
  • Culture-aware extraction:
    Inject culture-specific lists (e.g., day or month names) using merge variables.
  • Multi-line and tabular data:
    Preprocess text to handle line breaks, tabs, and paragraph boundaries for robust extraction.
  • Performance optimization:
    Enable chunked processing for documents with hundreds or thousands of pages.

Diagnostics

When diagnostic logging is enabled, Text Match records detailed information about the extraction process:

  • Regular Expression Log:
    Logs the compiled regular expression used for extraction, including injected variables and culture settings.
  • Chunking Log:
    Records chunk boundaries and extraction steps for large documents.
  • Result Filtering Log:
    Details the application of result filters and output options.
  • Extraction Summary:
    Reports the number of data instances produced and highlights any errors or validation issues.

These artifacts are accessible via the diagnostic interface and can be used to validate, troubleshoot, and tune your Text Match configuration.

Properties

NameTypeDescription
Matching
Options
Output

Derived Types

There are 5 implementations of Text Match.

Field Match Matches the value stored in a previously-extracted field or table column.
Label Match Matches a list of one or more label values, using matching options defined by a Labeling Behavior.
List Match Extracts values from document text that match any entry in a list of search terms.
Pattern Match Extracts values from document text that match a specified regular expression pattern.
Word Match Extracts individual words and multi-word phrases (N-grams) from document text for use in classification, data extraction, and normalization.

See Also

Used By

Notification