│ Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

Value Extractor

Embedded Object Grooper.Core

Value extractors are primitive operators which read data values from the text or visual content of a document.

Remarks

Value Extractor is the abstract base class for all primitive data extraction operators in Grooper. Value extractors form the foundation of Grooper's data extraction subsystem, enabling the recognition and extraction of basic elements such as dates, numbers, entity names, barcodes, checkboxes, labels, paragraphs, and more.

Overview

A ValueExtractor operates on the content of a document (or a subset of it), producing a list of one or more extracted values. In practice, a ValueExtractor takes a Data Instance as input and returns a list of Data Instances as output. This composable design allows the output from one extractor to be used as input to another, supporting recursive and hierarchical extraction strategies—down to individual fields and table cells.

Value extractors are consumed by higher-level objects such as Data Elements, Data Types, or Field Classes. These higher-level objects may leverage multiple value extractors to recognize complex entities such as label-value pairs, address blocks, table rows, sections, and hierarchies.

Value extractors also play a role in document classification, where they are used to extract lexical features such as words, bi-grams, trigrams, titles, and data types.

Key Features

  • Primitive Extraction:
    Provides the building blocks for extracting simple and complex data elements from document content.
  • Composable and Reusable:
    Value extractors can be chained or referenced by other extractors and data elements, supporting modular extraction logic.
  • Integration with Data Model:
    Value extractors are referenced by Data Fields, Data Columns, and other data elements to define how values are read during extraction.

Usage Examples

  • Direct Assignment:
    Assign a Pattern Match or List Match extractor directly to a Data Field to extract values using regular expressions or value lists.
  • Reference Extraction:
    Use a Reference extractor to reuse extraction logic across multiple fields or data elements.
  • OMR and Barcode Extraction:
    Use Labeled OMR to read checkboxes (OMR) or barcodes from document images.

Notes

  • Value extractors are typically configured as properties of Data Fields, Data Columns, or other data elements.
  • The output of a Value Extractor is always a collection of Data Instances, which may be further processed or validated by the data model.
  • For more details, see the documentation for each derived extractor type.

Derived Types

There are 24 implementations of Value Extractor.

AI Column Extractor Extracts structured content from documents with two-column layouts.
Ask AI Executes a completion using a large language model (LLM) and returns one hit for each choice in the response.
Detect Signature Detects a signature within a specified region of a document page by measuring the percentage of the area that is filled.
Entity Recognition Identifies and categorizes entities such as people, organizations, locations, and quantities in unstructured text.
Field Match Matches the value stored in a previously-extracted field or table column.
Find Barcode Searches for barcode values in document Layout Data previously detected during image processing.
Highlight Zone Defines a region of a document to be visually highlighted, without extracting any data values.
Key Phrase Extraction Identifies key concepts and topics in text using Azure AI Language key phrase extraction.
Label Match Matches a list of one or more label values, using matching options defined by a Labeling Behavior.
Labeled OMR Reads a group of one or more checkboxes located nearby text labels.
Labeled Value Extracts a field presented as a label-value pair within a document, associating labels and values based on their spatial relationship.
List Match Extracts values from document text that match any entry in a list of search terms.
Ordered OMR Reads one or more checkboxes with a consistent order of appearance inside a rectangular region.
Pattern Match Extracts values from document text that match a specified regular expression pattern.
Pii Entity Recognition Identifies, categorizes, and redacts sensitive information (PII) in unstructured text using Azure AI Language Services.
Query HTML Extracts values from an HTML document using a CSS or XPath selector.
Query XML Extracts values from XML documents using XPATH queries, enabling structured data extraction from XML content in Grooper.
Read Barcode Extracts barcode values from document images using configurable barcode recognition.
Read Metadata Reads a metadata value from a document by accessing a property on an attachment or content link.
Read Zone Extracts text content from a specified rectangular region (zone) of a document.
Reference Delegates extraction to another configured extractor, enabling reuse and centralization of extraction logic.
Select Page Selects and outputs the full content of one or more pages from a document, based on page number and/or content criteria.
Word Match Extracts individual words and multi-word phrases (N-grams) from document text for use in classification, data extraction, and normalization.
Zonal OMR Reads one or more checkboxes using manually-configured zones.

Used By

Notification