Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

Single - Fuzzy Match Cost Map

String Single Grooper.Core

Defines the cost and weighting rules for character substitutions, insertions, and deletions in fuzzy regular expressions.

Remarks

Overview

The Fuzzy Match Cost Map class configures how Grooper evaluates the "cost" of character-level differences between a fuzzy regular expression pattern and input text. It determines the penalty for substitutions (swaps), insertions, and deletions, directly impacting which matches are considered valid and their similarity scores. This class is typically populated from a Fuzzy Match Weightings lexicon, allowing fine-grained control over fuzzy matching behavior.

Entry Format for Fuzzy Match Weightings Lexicon

Each entry in a Fuzzy Match Weightings lexicon defines a specific cost or rule. Entries are key-value pairs, where the key specifies the operation or character(s), and the value is a numeric cost (float). The following formats are supported:

1. Global Operation Costs

  • swap = value
    Sets the base cost for any character substitution.
  • insert = value
    Sets the base cost for inserting any character.
  • delete = value
    Sets the base cost for deleting any character.

2. Character-Specific Costs

  • swap([A]) = value
    Sets the cost for substituting character A with any other character.
  • insert([A]) = value
    Sets the cost for inserting character A.
  • delete([A]) = value
    Sets the cost for deleting character A.
  • cost([A]) = value
    Sets the same cost for swap, insert, and delete operations for character A.

3. Pairwise Substitution Costs

  • AB = value
    Sets the cost for substituting character A with B.
  • swap(AB) = value
    Equivalent to the above; sets the cost for substituting A with B.
  • swap([[AEIOU]], [[aeiou]]) = value
    Sets the cost for substituting any vowel in the first set with any vowel in the second set.

4. Immutable Characters

  • immutable = [[chars]]
    Marks the listed characters as immutable, meaning they cannot be substituted, inserted, or deleted.
    Example: immutable = \r\n\t\f makes all control characters immutable.

5. Invalid or Duplicate Entries

  • Entries with invalid keys, missing values, or duplicates are ignored and logged as validation errors.

Examples

swap = 1.0
insert = 1.2
delete = 1.2
swap([O]) = 0.5
delete([0-9]) = 2.0
AB = 0.3
swap([[AEIOU]], [[aeiou]]) = 0.2
immutable = \r\n\t\f

Impact of Weighting Entries

  • Lower costs make a particular operation (swap, insert, delete) more "acceptable" during matching, increasing the likelihood that matches with those differences will be found.
  • Higher costs penalize certain changes, making matches less likely unless the input is very similar to the pattern.
  • Pairwise costs allow you to specify that certain substitutions (e.g., O for 0, or l for 1) are more or less likely, reflecting common OCR or data entry errors.
  • Immutable characters require exact matches for certain characters, ensuring that critical document boundaries such as \t and \n can be matched reliably.
  • Global operation costs set the default penalty for all characters unless overridden by a more specific entry.

How Entries Are Applied

  1. When evaluating a fuzzy match, the engine checks for the most specific applicable cost:
    • Pairwise substitution (e.g., AB)
    • Character-specific operation (e.g., swap([A]))
    • Global operation (e.g., swap)
  2. If no specific entry is found, the default base cost is used (typically 1.0).
  3. Immutable characters are never substituted, inserted, or deleted; attempts to do so are assigned the maximum cost.
  4. All costs are multiplied as needed by the base operation cost to produce the final penalty for each edit.

Best Practices

  • Use lower costs for common, acceptable errors (e.g., O0 in OCR).
  • Use higher costs or immutability for critical fields (e.g., account numbers).
  • Avoid excessive complexity; only override defaults where necessary for your data.
  • Review validation errors to ensure all entries are correctly formatted and applied.

Notification