Grooper Help - Version 25.0
25.0.0047 2,266

DI Analyze

Code Activity Grooper.Cloud

Analyzes document pages using Azure Document Intelligence to extract text, layout, style, and semantic elements.

Remarks

The DI Analyze activity leverages Azure AI Document Intelligence to recognize text, layout, style, and semantic elements on pages of a document and generate data describing the structure and content. The generated data is saved for use in downstream OCR and data extraction steps.

Role and Usage

DI Analyze is configured as a step in a Batch Process to automate document analysis. It submits pages or attachments to Azure Document Intelligence, retrieves structured results, and stores them for further processing. Users can select the Azure model, features, and content format to match their document types and extraction needs.

  • Supports both page-level and folder-level analysis, including attachments (PDF, TIFF, JPEG).
  • Results are saved as JSON files, enabling review and integration with Grooper's extraction pipeline.
  • Orientation correction can be enabled to automatically rotate pages based on detected layout, improving accuracy.

Configuration Guidance

  • Choose the Azure model and features appropriate for your documents (e.g., prebuilt-layout for general use).
  • Set the content format to match your document set and extraction requirements.
  • Enable 'Correct Orientation' to adjust page rotation based on layout analysis.
  • Use 'Overwrite' to control whether previous results are replaced.
  • Prefer attachments for documents where the original file is the best source for extraction.

Running at the folder vs. page level

When determining which scope to run DI Analyze at, consider the following:

  • Page level — Processing efficiency. Running at the page level lets you use a multithreaded Activity Processing service, which hands each page to the Document Intelligence service concurrently rather than processing the entire document at once. This can significantly speed up operations on large, multipage documents.
  • Folder level — Page-spanning structure awareness. When text structures like tables or paragraphs span multiple pages, folder-level processing allows DI Layout-based operations (such as AI Extract) to account for this. If Data Table or Data Section extraction is breaking across page boundaries, running DI Analyze at the folder level may produce better results.
  • Hard limitation — Separation is a page-level operation. To use DI Layout for AI Separate, you must run DI Analyze at the page level.
  • If unsure, we recommend starting at the page level. All DI Layout-based operations are available when DI data is present at the page level, giving you a reliable baseline. From there, you can test folder-level processing to weigh the tradeoffs between processing speed and extraction quality.

Diagnostics

Diagnostic artifacts generated by this activity include:

  • JSON result files for each analyzed object.
  • Markdown files containing extracted content.
  • HTML files for visual review.
  • Diagnostic images for lines, words, and paragraphs.

These artifacts support troubleshooting, review, and validation of extraction results.

Properties

NameTypeDescription
Parameters
Options
Processing Options

Used By

Notification