Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

Word Match

Text Match Grooper.Extract

Extracts individual words and multi-word phrases (N-grams) from document text for use in classification, data extraction, and normalization.

Remarks

Word Match is designed to locate and output single words or contiguous multi-word phrases from document text. It is a foundational tool for both data extraction and document classification in Grooper, enabling the identification of context-rich features such as names, titles, and key phrases.

What It's For

The primary purpose of Word Match is to break down document text into meaningful units—words and phrases—that can be used as features for classification, or as extracted values for data fields. By capturing not just individual words, but also sequences of words (N-grams), Word Match provides richer context for downstream processes. For example, the phrase "remittance advice" is more informative than the words "remittance" and "advice" considered separately.

Typical use cases include:

  • Extracting person names, organization names, or other multi-word entities.
  • Generating features for machine learning-based classification.
  • Normalizing and validating extracted values against vocabularies or lists.
  • Supporting advanced scenarios such as correcting OCR errors or handling multilingual documents.

How It Works

Word Match operates in two main steps:

  1. Word Identification:
    The extractor scans the document text and identifies words using a regular expression. This allows for language-specific, length-specific, or error-tolerant matching.
  2. Phrase Assembly:
    Adjacent words are grouped into phrases (N-grams) of configurable length. All possible contiguous N-grams are produced, subject to join rules and optional lookups or validation.

The output consists of all valid words and phrases found in the text, ready for use in classification, extraction, or normalization workflows. Phrase extraction is especially valuable for scenarios where context matters, such as distinguishing between "John Doe" and "Doe John", or identifying key phrases for document categorization.

Integration and Advanced Scenarios

Word Match integrates seamlessly with Grooper's extraction and classification system. It supports:

  • Validation and normalization of extracted values using vocabularies and lookups.
  • Custom output formatting for phrases.
  • Handling of OCR errors and multilingual content.
  • Use in both field extraction and classification activities.

Diagnostics

When diagnostic logging is enabled, Word Match produces artifacts that help validate and troubleshoot extraction:

  • Logs of matched words and phrases.
  • Timing and performance metrics for extraction steps.
  • Summaries of results and any validation issues.

Properties

NameTypeDescription
Matching
Options
Phrases
Output

See Also

Used By

Notification