Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

Fuzzy Match Weightings

Embedded Lexicon Grooper.Core

Defines custom character-level edit costs for fuzzy matching, enabling precise control over how swaps, insertions, and deletions are evaluated during data extraction.

Remarks

The Fuzzy Match Weightings class specifies custom weightings for fuzzy matching operations, allowing fine-tuned control over the cost of character-level edits such as swaps, insertions, and deletions. By configuring these weightings, users can optimize fuzzy matching accuracy for OCR, data extraction, and document processing scenarios where common character misrecognitions or typographical errors are expected.

By customizing weightings, you can instruct the fuzzy matching engine to treat certain character edits as more or less significant, improving both the likelihood of correct matches and the quality of automatic corrections. For example, you may reduce the cost of swapping visually similar characters (such as 'O' and '0', or 'B' and '8') to account for common OCR confusion, or make certain characters immutable to prevent them from being altered during matching.

How Weightings Work

Fuzzy matching operates by calculating the minimum "edit distance" between a document value and a pattern, where each edit operation (swap, insert, delete) has an associated cost. The total cost is then used to determine the match percentage. By default, all edit operations have a cost of 1.0, but Fuzzy Match Weightings allows you to override these defaults for specific characters or character pairs.

  • Swap: The cost to substitute one character for another.
  • Insert: The cost to insert a character present in the pattern but missing from the document.
  • Delete: The cost to remove a character present in the document but not in the pattern.
  • Immutable: Characters that cannot be edited (cost is infinite).

Syntax for Defining Weightings

Weightings are defined using a simple lexicon syntax, with each entry specifying an operation, the target character(s), and the associated cost. See Fuzzy Match Cost Map for more details on the format of entries. The following general forms are supported:

  • Immutable=CharSet
    Marks all characters in CharSet as immutable. Example: Immutable=\r\n\t\f.

  • Swap(CharacterPair)=Cost
    Sets the cost for swapping a specific character pair. Example: Swap(S5)=0.25 or S5=0.25.

  • Swap([[FromSet]], [[ToSet]])=Cost
    Sets the cost for swapping any character in FromSet with any in ToSet. Example: Swap([;:'"=-],[\d])=2.

  • Swap(CharSet)=Cost
    Sets the cost for swapping any character in CharSet for a different character. Example: Swap(0-9)=1.5.

  • Delete(CharSet)=Cost
    Sets the cost for deleting any character in CharSet. Example: Delete(')=0.1.

  • Insert(CharSet)=Cost
    Sets the cost for inserting any character in CharSet. Example: Insert(.,)=0.50.

  • Swap=BaseCost, Delete=BaseCost, Insert=BaseCost
    Sets the default cost for all swap, delete, or insert operations. Default is 1.0 for each.

Practical Guidance

  • Use lower costs for character pairs that are frequently confused (e.g., O0=0.25 for 'O' and '0').
  • Make control or formatting characters immutable to prevent accidental edits (e.g., Immutable=\r\n\t\f).
  • Increase costs for edits that should be discouraged or are unlikely in your data.
  • If no weightings are defined, all edit operations default to a cost of 1.0.

Example Scenario

Suppose a document contains the value 245B instead of 2458 due to an OCR error. By default, this would match a pattern like \d{4} at 75%. If you add a weighting B8=0.5, the match percentage increases to 87.5%, and the fuzzy matching engine can automatically correct the value to 2458.

Sample Weightings Lexicon

Immutable=\t\r\n\f Delete(')=0.10 Insert(\d)=2 Swap(['".,;:-],[\d])=2 O0=0.25 o0=0.25 Q0=0.25 D0=0.25 C0=0.25 c0=0.5 I1=0.25 i1=0.25 ]1=0.25 l1=0.25 t1=0.25 L1=0.75 il=0.25 !1=0.25 ?2=0.75 ?7=0.8 Z2=0.5 B3=0.5 A4=0.5 S5=0.25 s5=0.25 G6=0.5 B8=0.25 g9=0.25

Best Practices

  • Review common OCR or data entry errors in your documents and adjust weightings accordingly.
  • Test your configuration with real-world samples to ensure optimal extraction accuracy.

Properties

NameTypeDescription

See Also

Used By

Notification