Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

OCR Cleanup

IP Command Grooper.IP

Performs temporary image cleanup to improve OCR results by removing non-text features.

Remarks

The OCR Cleanup command preprocesses images for OCR by removing content that typically interferes with text recognition, such as lines, barcodes, halftone patterns, specks, and checkboxes. This command is designed for use in OCR workflows where the goal is to maximize text recognition accuracy by temporarily removing non-text features.

The cleanup pipeline executes multiple stages:

  1. Halftone Removal: Detects and removes halftone patterns that can confuse OCR engines.
  2. Hole Punch Removal: Removes circular artifacts from hole punches.
  3. Small Speck Removal: Eliminates small isolated dots and noise.
  4. Barcode and Box Removal: Detects and removes barcodes and box-like features.
  5. Large Speck Detection: Identifies and removes larger specks using both morphological and watershed techniques.
  6. Text Segment Detection: Segments and preserves regions likely to contain text, protecting them from dropout.
  7. Line Removal: Removes horizontal and vertical lines that may interfere with text recognition.
  8. Checkbox Detection (optional): If enabled, detects and removes checkboxes, which are common in forms.

The command applies a combination of binarization, dropout, and feature detection to isolate text. It supports separate binarization strategies for low-contrast images, automatically switching to more aggressive or adaptive thresholding when needed.

Configuration and Usage

  • Use 'Dropout Method' to control which non-text features are removed and how dropout regions are filled.
  • Configure 'Binarization' and 'Low Contrast Binarization' to optimize text preservation and feature removal for different image types.
  • Enable 'Detect Check Boxes' to remove checkboxes from forms and surveys.
  • Always review the cleaned image to ensure important content is not lost.

Supported Pixel Formats

  • Pixel8bppGrayscale
  • Pixel24bppBgr
  • Pixel1bppIndexed

Images are automatically converted as needed for processing.

Diagnostics

When run in diagnostic mode, OCR Cleanup generates diagnostic images and logs for each stage of the cleanup pipeline:

  • Binarized: Shows the effect of thresholding and preprocessing.
  • After Negated Region Removal: Displays the image after negative region cleanup.
  • Dropout Mask: Visualizes the regions that will be removed.
  • Log Messages: Reports feature detection, processing steps, and timing.

Use these diagnostics to fine-tune detection parameters and ensure that only the intended features are removed.

Notes

  • This command is intended for temporary cleanup in OCR workflows; it should not be used for archival image cleanup.
  • Use this command as an altenative to building a custom IP Profile for OCR preprocessing.
  • The cleanup pipeline is designed to maximize OCR accuracy by removing only those features that are likely to interfere with text extraction.
  • OCR Cleanup does not generate classification features directly, but the results can impact downstream OCR and extraction.

Properties

NameTypeDescription
General
Command Info

See Also

Used By

Notification