Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

Fuzzy Regular Expression

Grooper.Core

Represents a fuzzy regular expression, enabling pattern matching with tolerance for errors, variations, or misspellings.

Remarks

Overview

The Fuzzy Regular Expression class provides advanced pattern matching capabilities that allow for inexact matches between a pattern and input text. Unlike standard regular expressions, which require exact matches, a fuzzy regular expression (FRX) can identify values that are similar to the pattern, even if they contain typos, substitutions, insertions, or deletions. This is especially useful for extracting data from noisy, inconsistent, or error-prone sources such as OCR output, scanned documents, or user-entered text.

Fuzzy regular expressions are a core component of Grooper's data extraction engine, enabling robust recognition of structured and semi-structured data where perfect accuracy cannot be guaranteed.

Usage and Configuration

  • Define a fuzzy pattern using FRX syntax, which is similar to standard regular expressions but with some limitations.
  • Configure matching options such as case sensitivity, required mode, and custom cost maps to control how errors are penalized.
  • Use the fuzzy matching engine to search for all occurrences of the pattern in a given input, specifying a minimum similarity threshold to filter results.
  • FRX supports both single-threaded and multi-threaded execution for efficient processing of complex patterns or large input sets.

How It Works

  • The pattern is parsed into a set of runtime expressions, representing all valid permutations of the fuzzy pattern.
  • For each input value, the engine computes the optimal alignment between the pattern and the text, allowing for insertions, deletions, and substitutions as defined by the cost map.
  • Each match is scored based on its similarity to the pattern, and only those above the specified threshold are returned.
  • Grouping and named capture are supported, allowing extraction of subfields or components from the matched text.
  • The matching process can be tuned for best value (longest match) or least cost (highest confidence), depending on the extraction scenario.

Syntax and Limitations

Fuzzy regular expressions support most standard regex features, but with some restrictions:

  • Variable-length quantifiers (such as * and +) and certain anchors are not allowed.
  • Only basic named and unnamed group constructs are supported.
  • Some character escapes and advanced constructs are not available in FRX mode.
  • Required mode can be toggled within the pattern using the (?r) and (?-r) syntax.

For a full list of supported and unsupported features, see the documentation for FRX syntax.

Performance Considerations

Fuzzy matching is computationally intensive, especially for patterns with high perplexity (many possible permutations). Use FRX judiciously for scenarios where inexact matching is essential, and test performance on representative data sets. For simple, exact matches, standard regular expressions are recommended.

Example

The following FRX pattern matches variations of the word "payment", allowing for common OCR errors: (NET )?PAYMENT

This pattern will match "NET PAYMENT", "NFT PAYMENT", or "PAYMENT" with varying confidence scores, depending on the similarity of the input to the pattern.

Best Practices

  • Use clear, targeted patterns to minimize unnecessary permutations.
  • Adjust the minimum similarity threshold to balance recall and precision.
  • Review match results and confidence scores to ensure data quality.
  • Avoid using FRX for patterns with excessive complexity unless absolutely necessary.

Diagnostics

During matching, diagnostic information such as match confidence, alignment paths, and group results can be collected and reviewed to troubleshoot extraction issues or optimize pattern design.

For more information, see the documentation for fuzzy regular expressions, cost maps, and Grooper's data extraction engine.

Notification