Grooper Help - Version 25.0
25.0.0024 2,166

Pattern Match

Text Match Grooper.Extract

Extracts values from document text that match a specified regular expression pattern.

Remarks

The Pattern Match extractor is a highly flexible value extractor designed to identify and extract text segments that match a user-defined regular expression pattern. It is ideal for scenarios where precise pattern-based extraction is required, such as dates, codes, names, or other structured data elements. The extractor supports a rich set of configuration options, enabling robust handling of complex document layouts and variable data formats.

How It Works

  • The extractor uses the 'Value Pattern' property to define the regular expression used for matching. This pattern can include named groups, merge variables, and .NET regular expression syntax.
  • All matches found in the document are returned as extraction results. Named groups within the pattern allow for structured output, with each group accessible by name.
  • Fuzzy matching can be enabled to tolerate minor OCR or typographical errors, increasing recall in noisy or variable documents.
  • Preprocessing and postprocessing options allow for normalization, filtering, lookups, and output formatting, supporting a wide range of extraction scenarios.
  • Constrained extraction restricts matching to specific regions or contexts within the document, improving accuracy in structured layouts.

Configuration Guidance

  • Define the regular expression in 'Value Pattern' to match the desired data format. Use named groups for structured extraction and output formatting.
  • Enable fuzzy matching for documents with variable quality or frequent OCR errors, but be aware of syntax limitations and performance considerations.
  • Use 'Output Format' to reformat extracted values, referencing named groups with {GroupName} placeholders. Typecasting and format specifiers are supported for advanced scenarios.
  • Configure group options and lookups to validate, normalize, or translate extracted values, improving data quality and consistency.
  • Use constrained wrap options to limit extraction to relevant regions, such as table cells or form fields.

Usage Scenarios

  • Date Extraction: Extract dates in various formats and output them in a standardized format using named groups and output formatting.
  • Code or Identifier Extraction: Match structured codes, account numbers, or identifiers with complex patterns and validate them using lookups.
  • Table and Section Extraction: Use named groups to map extracted values into table columns or structured data fields for downstream processing.
  • Entity Extraction: Identify and extract names, addresses, or other entities using flexible regular expressions and normalization options.

Advanced Features

  • Merge Variables: Inject external values or patterns into the regular expression using @VariableName syntax.
  • Named Groups: Structure output by assigning names to subexpressions, enabling direct mapping to child elements or output fields.
  • Fuzzy Matching: Allow approximate matches to correct minor errors, increasing recall in variable or degraded documents.
  • Output Formatting: Reformat extracted values using group placeholders, typecasting, and .NET format specifiers for normalization and presentation.
  • Constrained Extraction: Restrict matching to specific regions or contexts, improving accuracy in structured or multi-column layouts.

Diagnostics

When enabled, diagnostic logging captures information about extraction results, matched patterns, group values, and any errors encountered during processing. This can be used to troubleshoot missed matches, validate regular expression logic, and optimize extractor configuration.

Practical Tips

  • Test regular expressions using online tools or the built-in editor to ensure correct syntax and expected matches.
  • Use named groups for structured extraction and output formatting, especially when mapping results to tables or fields.
  • Enable fuzzy matching judiciously, as it may impact performance on complex patterns.
  • Regularly review extraction results and diagnostic logs to refine patterns, group options, and normalization settings.

Properties

NameTypeDescription
Matching
Options
Lookup
Output

See Also

Used By

Recommended Content

Notification