Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

AI Collection Reader

AI Section Reader Grooper.GPT

Extracts a Section Instance Collection from a document using generative AI.

Remarks

The AI Collection Reader extends the capabilities of AI Section Reader to multi-instance Data Sections, which represent repeating records inside a document.

Note that it is also possible to extract multi-instance Data Sections using the AI Extract fill method. The main difference is that AI Collection Reader is optimized for processing large multi-page documents which need to be processed in chunks to avoid exceeding the context langth large language models (LLMs).

How It Works

The extraction workflow for AI Collection Reader consists of several coordinated steps:

  1. Chunking Large Documents:
    If the target section is a collection and chunking is enabled, the document is divided into smaller segments (chunks) based on the configured chunk size (in pages). Each chunk is processed independently to ensure that the quoted content and prompt remain within the LLM's context window.

  2. Parallel Processing:
    Chunks are processed in parallel, up to the specified maximum degree of parallelism. This allows for efficient extraction from very large documents, reducing overall processing time and leveraging available system resources.

  3. Prompt Construction and LLM Completion:
    For each chunk, a prompt is constructed using the configured quoting method, extraction schema, and instructions. The prompt is sent to the LLM, which returns a JSON array of extracted section instances.

  4. Data Mapping:
    The returned JSON array is parsed and mapped to individual Section Instances within the collection. Each instance is imported and associated with its corresponding chunk of document content.

  5. Error Handling and Diagnostics:
    Any errors encountered during chunk processing (such as LLM failures or schema mismatches) are logged and reported. Diagnostic artifacts are generated for each chunk and for the overall extraction operation.

This approach enables reliable extraction from documents that would otherwise exceed LLM token limits, supports high-throughput processing, and ensures that multi-instance sections are accurately captured.

Configuration Guidance

  • Chunk Size:
    Set the chunk size to control how many pages are included in each extraction segment. Use smaller chunk sizes for very large documents or when LLM context limits are a concern.

  • Max Degree of Parallelism:
    Adjust this value to control how many chunks are processed simultaneously. Higher values increase throughput but may consume more system resources.

  • Section Type:
    AI Collection Reader is only used for sections configured as collections. For single-instance sections, AI Section Reader is used automatically.

Diagnostics and Logging

The following diagnostic artifacts are generated during extraction and can be reviewed for troubleshooting, validation, and optimization:

  • Schema.json: The JSON schema provided to the LLM for each extraction operation.
  • Response Data.json: The raw JSON response returned by the LLM for each chunk.
  • Chat Log.jsonl: The complete chat conversation for each chunk, including prompts and responses.
  • Operation Log Entries: Chronological logs of key steps, chunk counts, and errors.
  • Error Messages: Details of any errors encountered during chunk processing or data mapping.
  • Performance Timers: Timing data for chunk processing and overall extraction.

These diagnostics provide transparency into the extraction process and support prompt engineering, troubleshooting, and performance tuning.

Usage Scenarios

  • Extracting line items from invoices, transaction logs, or repeating records from large documents.
  • Processing multi-page tables or collections that exceed LLM context limits.
  • Accelerating extraction for high-volume or resource-intensive workflows.

LLM Connector Requirement

This extractor requires a properly-configured LLM Connector on the repository Root to communicate with the LLM service. Ensure the connector is set up in your environment.

Properties

NameTypeDescription
General
Chunking
Options

See Also

Used By

Notification