Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

List Match

Text Match Grooper.Extract

Extracts values from document text that match any entry in a list of search terms.

Remarks

The List Match extractor is designed to identify and extract text segments that correspond to a set of defined terms, such as field labels, headers, entity names, or classification features. It is ideal for scenarios where the same concept may be represented by multiple spelling, formatting, or layout variations across documents.

How It Works

  • The extractor uses the 'Vocabulary' property to define the set of terms to match. These can be entered directly as local entries or referenced from external lexicons.
  • Each entry in the vocabulary is treated as a distinct search term. When a match is found in the document, the value is extracted and included in the output.
  • If the vocabulary is configured as a lookup lexicon and 'Translate' is enabled, matched values are replaced with their normalized or abbreviated forms, supporting consistent output for downstream processing.
  • Matching can be enhanced with fuzzy matching, allowing approximate matches to correct minor OCR or typographical errors.
  • Advanced options such as vertical wrapping and constrained wrapping enable detection of terms split across multiple lines or restricted to specific regions, improving extraction in complex layouts.

Configuration Guidance

  • Define all expected term variants in the 'Vocabulary' property, including alternate spellings, abbreviations, and formatting differences.
  • Use local entries for field-specific lists, or reference external lexicons for shared or large lists.
  • Enable 'Translate' and configure key-value pairs in the vocabulary to normalize output values. For example, International Business Machines=IBM will output "IBM" when "International Business Machines" is matched.
  • Adjust fuzzy matching settings to tolerate OCR errors or minor spelling differences, especially in noisy documents.
  • Use vertical and constrained wrap options to capture terms that span multiple lines or are confined to specific regions.

Usage Scenarios

  • Field Label Extraction:
    Extract field labels or headers from forms, tables, or semi-structured documents, even when labels are wrapped across lines or appear with minor variations.
  • Entity Name Normalization:
    Map multiple label variants (e.g., "International Business Machines", "IBM Corporation") to a single normalized value ("IBM") for consistent classification or export.
  • Classification Feature Extraction:
    Identify and extract features used for document classification, supporting robust recognition of document types with variable terminology.

Advanced Features

  • Fuzzy Matching:
    Allows approximate matches to correct minor errors, increasing recall in variable or degraded documents.
  • Vertical Wrapping:
    Detects terms split across multiple lines, such as column headers in tabular data.
  • Constrained Wrapping:
    Restricts extraction to specific areas of the document, improving accuracy in structured layouts.
  • Case Handling:
    The 'Use List Case' property controls whether output values reflect the case of the matched document text or the case of the list entry.

Practical Tips

  • Regularly review and update the vocabulary to ensure all relevant term variants are included.
  • Test extraction with representative document samples to verify matching behavior and adjust settings as needed.
  • Use diagnostic logs to troubleshoot missed or incorrect matches, and refine vocabulary or matching options for optimal results.
  • For translation scenarios, ensure all key-value pairs are correctly defined in the vocabulary to avoid unexpected output.

For more details, see the documentation for each property and the List Match wiki page.

Properties

NameTypeDescription
Matching
Options
Output

Derived Types

There are 1 implementations of List Match.

Label Match Matches a list of one or more label values, using matching options defined by a Labeling Behavior.

See Also

Used By

Notification