Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

Tabular Layout

Table Extract Method Grooper.Extract

Detects the layout of a table automatically using header labels and value extractors.

Remarks

The Tabular Layout extract method is designed to extract tabular data from documents by automatically detecting table headers, rows, and footers using a combination of label sets, value extractors, and layout analysis. This method is ideal for structured tables where headers and rows are clearly defined, such as invoices, statements, and reports.

Overview

  • Tabular Layout uses Label Sets to identify table headers and footers, increasing accuracy in header/row detection.
  • It supports both single-line and multi-line headers, as well as tables with or without explicit footer rows.
  • The method is robust to variations in table formatting, including merged header cells and stacked or wrapped row content.
  • Tabular Layout can be configured to extract data from the footer row if needed, or to use the footer as a strict boundary.

How It Works

  1. Header Detection:

    • The 'Header Detection' property defines how table headers are found, using label sets and/or value extractors.
    • For best results, include a header label that covers all header lines, unless the table uses vertically centered multi-line labels that overlap on the Y axis.
    • Example of a suitable header:
      QuantityDescriptionUnit
      Price
      Extended
      Price
    • Avoid using a header label if header cells overlap vertically, as this can reduce detection accuracy.
  2. Row Detection:

    • The 'Row Detection' property controls how table rows are identified, using value extractors and layout analysis.
    • Rows are detected based on the presence of values in required columns and their alignment with header cells.
    • Multi-line rows are supported via the 'Multiline Rows' property, which can be enabled to capture wrapped or stacked content.
  3. Footer Detection:

    • The 'Footer Detection' property (or a footer label in the label set) is used to identify the end of the table.
    • When a footer is detected, extraction stops at the line above the footer, preventing false positives beyond the table's end.
    • The 'Capture Footer Row' property allows you to include the footer row as data if needed.

Tabular Layout Options Extension

The Tabular Layout Options extension enables per-column configuration for extraction and validation.
Use this extension to:

  • Mark columns as required, ensuring that only rows with data in those columns are extracted.
  • Override default extraction behavior for specific columns, such as using OCR or OMR for certain data types.
  • Fine-tune validation and completeness rules for each Data Column.

For example, you might mark the "Amount" column as required to filter out incomplete rows, or set a column to use OMR for checkboxes.

Configuration Guidance

  • Always test header and row detection on multiple sample documents to ensure robust extraction.
  • Use label sets to improve header/footer detection, but avoid overlapping header labels.
  • Enable 'Multiline Rows' for tables with wrapped or stacked content.
  • Use the Tabular Layout Options extension to enforce data completeness and handle special column types.

Best Practices & Troubleshooting

  • If extraction includes unwanted rows or misses data, review header/row detection settings and label set configuration.
  • For tables with complex headers, experiment with different header label selections and value extractors.
  • Use the diagnostics output to review detected headers, rows, and footers for tuning.

The Tabular Layout method is highly flexible and can be adapted to a wide range of tabular document layouts.
For more information, see the documentation for Data Table, Data Column, and Tabular Layout Options.

Properties

NameTypeDescription

See Also

Used By

Notification