Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

Horizontal Tab Marker

Embedded Object Grooper.Core

Detects and inserts tab characters into text based on whitespace gaps, font size changes, or document layout features such as vertical lines and underlines.

Remarks

The Horizontal Tab Marker class is used to identify locations in text where a tab character (\t) should be inserted, typically to represent columnar or tabular structure in extracted document content.

Overview

Horizontal Tab Marker analyzes the spacing between words and other layout cues to determine where tabs should be placed. It is commonly used in text preprocessing to convert visually separated columns or fields into a tab-delimited format, making downstream data extraction and parsing more reliable.

How It Works

The Horizontal Tab Marker processes the text by analyzing the gaps between words and determining where a tab character should be inserted.

The main algorithm works as follows:

  • The text is split into word instances.
  • For each pair of adjacent words, the gap between them is measured.
  • A set of rules is evaluated to determine if the gap qualifies for tab insertion. These rules include:
    • Whitespace Gaps: If the space between two words meets or exceeds the configured 'Minimum Tab Width', it is replaced with a tab character.
    • Relative to Text Height: Optionally, gaps can be evaluated as a percentage of the average character height using the 'Character Size Ratio' property.
    • Font Size Changes: If the font size changes between adjacent words by more than the 'Font Size Threshold', a tab may be inserted.
    • Vertical Lines: When enabled via 'Detection Options', vertical lines in the document layout can trigger tab insertion at their intersection with text.
    • Underlines: When underline detection is enabled, tabs are suppressed for whitespace gaps that are underlined, supporting fill-in-the-blank scenarios.
    • If the gap meets any of the criteria, the whitespace is replaced with a tab character.

This approach ensures that visually separated columns or fields are accurately marked with tabs, improving the reliability of downstream data extraction and parsing.

Configuration Guidance

  • Set 'Minimum Tab Width' to control the minimum gap size (in inches) that qualifies for tab insertion.
  • Use 'Character Size Ratio' to enable gap detection relative to text height, which is useful for documents with variable font sizes.
  • Adjust 'Font Size Threshold' to trigger tabs on significant font size changes, helping to separate fields with different formatting.
  • Use 'Detection Options' to enable or disable vertical line and underline detection as needed for your document layout.

Example 1: Field Extraction with Large Whitespace Gap

For example, consider a document region containing two field values with a large whitespace gap in between, like this:

PATIENT NAME: JOHN DOE INTAKE DATE: 01/01/2019

When text is extracted without tab marking, the large gap is represented as a single space, making it difficult to determine where one field ends and the next begins. If you use an extractor with a pattern like PATIENT NAME: [A-Z ]+, it will overmatch and return "JOHN DOE INTAKE DATE" instead of just "JOHN DOE", because the input data looks like this:

PATIENT NAME: JOHN DOE INTAKE DATE: 01/01/2019

By enabling tab marking, the large gap is replaced with a tab character. Now an extractor looking for PATIENT NAME: [A-Z ]+ will match only JOHN DOE, because the regular expression will stop capturing when it encounters the TAB character:

PATIENT NAME: JOHN DOE\tINTAKE DATE: 01/01/2019

Example 2: Table Row with Multiple Columns

Consider a table row in a document where columns are separated by large whitespace gaps:

Name State Age

Without tab marking, the extracted text may look like:

Name State Age

This makes it difficult to reliably extract each column value. With tab marking enabled, the output will be:

Name\tState\tAge

Now, each value is clearly separated by a tab character, making column-based extraction straightforward and robust.

Notes

  • Horizontal Tab Marker is typically used as part of a text preprocessing pipeline before data extraction.
  • Proper configuration of tab detection options is essential for accurate column and field separation, especially in documents with complex layouts.
  • For more information on related concepts, see Data Instance, Document Instance, and TabOptions.

Properties

NameTypeDescription

Used By

Notification