Grooper Help - Version 25.0
25.0.0023 2,165
  • Overview
  • Help Status

AI Schema Extractor

Value Extractor Grooper.GPT

Extracts structured data from documents using a large language model (LLM) guided by a user-defined JSON schema.

Remarks

The AI Schema Extractor enables advanced, schema-driven data extraction from unstructured or semi-structured documents by leveraging generative AI. It is designed for scenarios where precise, reliable, and repeatable extraction of structured data is required, such as tables, line items, or multi-field records.

This extractor interacts with a configured Data Generator to send document content and extraction instructions to an LLM. The LLM is prompted to return data that conforms to a user-supplied JSON schema, ensuring the output matches the required structure for downstream processing in Data Tables, Data Sections, or other data models.

How It Works

  1. Prompt Construction:
    The extractor builds a prompt for the LLM, including a quote from the document, the JSON schema, and any custom extraction instructions.
  2. LLM Invocation:
    The Data Generator sends the prompt to the LLM, requesting a response that matches the schema.
  3. Response Parsing:
    The extractor parses the returned JSON, optionally using a selector to isolate relevant data, and maps it into a hierarchy of Data Instances.
  4. Validation:
    The schema is validated at runtime. If the response does not match the schema, or if required fields are missing, extraction errors are reported.

Usage and Configuration

  • Use this extractor when you need highly structured output from AI, such as extracting tables, lists, or objects with multiple fields.
  • Configure the 'Generator' property to select the LLM model and output mode.
  • Provide a valid JSON schema in the 'JSON Schema' property to define the expected output structure.
  • Optionally, supply extraction instructions and a selector to fine-tune the extraction process.
  • Enable 'Parse JSON Response' to map the LLM's output into Data Instances for use in Data Tables or Data Sections.

Diagnostics and Logging

When diagnostics are enabled, the AI Schema Extractor logs key artifacts for each extraction operation:

  • Chat Log.jsonl:
    The full conversation with the LLM, including prompts and responses.
  • JSON Schema.json:
    The schema provided to the LLM for the current extraction.
  • Response Data.json:
    The raw JSON returned by the LLM before parsing.
  • LLM Completion Operation Timer:
    Timing information for the LLM request.
  • Error Logs:
    Details of any errors, retries, or parsing failures.

These diagnostics are accessible through Grooper's diagnostic tools and are essential for troubleshooting, prompt engineering, and validating extraction results.

When to Use

  • Extracting structured data from complex or variable documents.
  • Enforcing strict output formats for downstream automation.
  • Reducing post-processing by ensuring the LLM returns only the required fields.
  • Supporting advanced review, validation, and audit scenarios in Grooper.

For best results, provide a well-defined JSON schema and clear extraction instructions. Review diagnostic artifacts to refine prompts and troubleshoot extraction issues.

Properties

NameTypeDescription
General
Response

See Also

Used By

Notification