Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

Text Preprocessor

Embedded Object Grooper.Core

Applies configurable text preprocessing to a document's content before regular expression extraction.

Remarks

The Text Preprocessor enables advanced manipulation of control characters in a document's text, allowing regular expressions to match or ignore structural elements such as line breaks, paragraph boundaries, page breaks, tabs, and spaces.

Overview

Text preprocessing is performed immediately before extraction, transforming the document's text to improve the accuracy and flexibility of pattern matching. This is especially useful when data values span multiple lines, are separated by large whitespace gaps, or are affected by inconsistent formatting.

Key Features

  • Paragraph Marking:
    Detects paragraph boundaries and converts line breaks within paragraphs to spaces, while preserving paragraph-ending breaks. This allows extractors to match values that span multiple lines within a paragraph, without matching across paragraph boundaries. See Paragraph Marker.

  • Tab Marking:
    Replaces large horizontal whitespace gaps with TAB characters, making it possible to distinguish between normal spaces and significant gaps in regular expressions. See Horizontal Tab Marker.

  • Vertical Tab Marking:
    Converts certain line breaks to vertical tab characters based on vertical spacing, enabling recognition of vertical structure in tabular or multi-column layouts. See Vertical Tab Marker.

  • Control Character Ignoring:
    Removes or replaces selected control characters (such as spaces, newlines, form feeds, and carriage returns) according to the 'Ignore Control Characters' setting. This can simplify extraction in documents with inconsistent or excessive whitespace.

Usage Guidance

  • Configure the desired preprocessing options by enabling or disabling paragraph, tab, and vertical tab marking, and by selecting which control characters to ignore.
  • Preprocessing is typically used in conjunction with regular expression-based extractors, but can benefit any extraction scenario where document structure affects pattern matching.
  • For best results, adjust preprocessing settings to match the structure and formatting of your source documents.

Example Scenarios

  • Extracting values that span multiple lines within a paragraph:
    Enable paragraph marking to convert internal line breaks to spaces, allowing regular expressions to match values split across lines.

  • Distinguishing between normal spaces and large gaps:
    Enable tab marking to insert TAB characters at significant horizontal gaps, so extractors can target fields separated by large whitespace.

  • Cleaning up unwanted whitespace or control characters:
    Use the 'Ignore Control Characters' option to remove or replace problematic characters that interfere with extraction.

For more details, see the documentation for Paragraph Marker, Horizontal Tab Marker, and Vertical Tab Marker.

Examples

1. Sample Document

Consider the following sample document.

┌─────────────────────────────────────────────────────────────┐
│                        SAMPLE FORM                          │
├─────────────────────────────────────────────────────────────┤
│ Name:           John Doe                   ID: 12345        │
│ Date of Birth:  01/01/1980                 Status: Active   │
├─────────────────────────────────────────────────────────────┤
│ This is the first paragraph. It explains the purpose of     │
│ the form and the meaning of each field.                     │
│                                                             │
│ Please complete all fields and verify all personal          │
│ information before submitting. Thank you!                   │ 
└─────────────────────────────────────────────────────────────┘

2. Default Control Characters

With no preprocessing options enabled, the document data will look like this. Whitespace gaps, no matter how large, are represented by a single space character. A \r\n pair marks each location where the original document wrapped to the next line.

SAMPLE FORM\r\n
Name: John Doe ID: 12345\r\n
Date of Birth: 01/01/1980 Status: Active\r\n
This is the first paragraph. It explains the purpose of\r\n
the form and the meaning of each field.\r\n
Please complete all fields and verify all personal\r\n
information before submitting. Thank you!\r\n

3. Preprocessed Version

Preprocessing the document with paragraph marking and tab marking will place a tab character '\t' at each large whitespace gap, and replace newline pairs '\r\n' occuring inside a paragraph with a space.

SAMPLE FORM\r\n
Name: John Doe\tID: 12345\r\n
Date of Birth: 01/01/1980\tStatus: Active\r\n
This is the first paragraph. the form and the meaning of each field.\r\n
Please complete all fields and verify all personal information before submitting. Thank you!\r\n

Properties

NameTypeDescription

See Also

Used By

Notification