Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

Paragraph Marker

Embedded Object Grooper.Core

Detects and marks paragraph boundaries in natural language documents to improve data extraction from paragraph flow text.

Remarks

The Paragraph Marker is a text preprocessing component used to identify paragraph boundaries in documents, especially those containing natural language text. By marking paragraphs, it enables more accurate extraction of data that may span multiple lines within a paragraph, while preserving true paragraph breaks.

Purpose

Paragraphs in documents often wrap across multiple lines, causing data values to be split by line breaks (CR/LF). This can make it difficult for extractors to match values that span lines, as standard extraction logic may not account for embedded line breaks within paragraphs.

The Paragraph Marker solves this by detecting paragraph boundaries and converting line breaks inside paragraphs to spaces, while leaving the line break at the end of each paragraph intact. This produces a normalized text flow, making it easier to extract values that span lines.

How It Works

The Paragraph Marker processes the text of a document by analyzing each line and determining whether it should be joined with the previous line or treated as the start of a new paragraph.

The main algorithm works as follows:

  • The text is split into lines.
  • For each line, a set of rules is evaluated to determine if it is the start of a new paragraph. These rules include:
    • Line width (absolute and relative to the widest line)
    • Presence of large horizontal or vertical gaps
    • Indentation changes
    • Custom bullet or pattern matches (using 'Paragraph Break Rule')
    • Detection options such as bullets, double spacing, and underlines
  • If a line is determined to be a paragraph start, the previous paragraph is finalized, and a new paragraph begins.
  • Line breaks within paragraphs are replaced with spaces, while true paragraph breaks are preserved as CR/LF pairs.

This approach ensures that wrapped lines within a paragraph are merged for extraction, while true paragraph boundaries are maintained for downstream processing.

Example

Consider the following paragraph, where the effective date is split across two lines:

This agreement, to be effective February
1, 1988, is executed on January 15, 1988.

Without paragraph marking, an extractor searching for "February 1, 1988" would not find a match due to the embedded line break (\r\n) after 'February'. With paragraph marking enabled, the text is normalized as:

This agreement, to be effective February 1, 1988, is executed on January 15, 1988.

Now, extractors can reliably match values that span lines within a paragraph, without overmatching across true paragraph boundaries.

Configuration Guidance

  • Use the 'Minimum Line Width' and 'Line Wrap Threshold' properties to control how paragraph boundaries are detected based on line length.
  • Adjust 'Maximum Horizontal Gap' and 'Line Spacing Limit' to fine-tune detection for documents with variable spacing or formatting.
  • Enable detection options such as bullets, double spacing, or underlines to handle specialized paragraph structures.
  • Use the 'Paragraph Break Rule' property to define custom logic for identifying the start of new paragraphs, such as custom bullet formats.

Usage Notes

  • Paragraph Marker is typically used as part of a text preprocessing pipeline before data extraction.
  • Proper configuration is essential for accurate paragraph detection, especially in documents with complex or inconsistent formatting.
  • For more information on related concepts, see Data Instance, Document Instance, and Value Extractor.

Properties

NameTypeDescription

See Also

Used By

Notification