Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

Detect Language

Code Activity Grooper.GPT

Performs AI-based language detection using an LLM.

Remarks

The Detect Language activity uses a large language model (LLM) to analyze document content and determine the most appropriate ISO language or locale code. This enables Grooper to automatically identify the language of documents or pages, supporting downstream activities such as classification, extraction, or export that may depend on language- specific configurations.

Usage

  • Use this activity when handling multilingual or regionally diverse document sets.
  • Should be executed after Recognize and before Extract in a Batch Process.
  • Ensures that currency, numeric, and date/time values are intepreted and displayed correctly..

How It Works

  • The activity operates on either Batch Folders or Batch Pages, depending on the processing scope.
  • For each item, it extracts text from both the beginning and end of the document, as specified by the 'Page Depth' property, to ensure a representative language sample is analyzed.
  • The extracted text is sent to the configured LLM model, which is prompted to return either a two-letter ISO 639-1 language code (e.g., en, fr) or a full locale code (e.g., en-US, fr-FR), depending on whether 'Detect Locale' is enabled.
  • The detected code is stored in the CultureCode property of the processed node, making it available for later activities.
  • If the LLM does not return valid JSON or a code cannot be determined, an error is raised and logged.

Language and Locale Impact on Value Interpretation

Correct detection of a document's language and locale is essential for the accurate interpretation and display of extracted values, especially for numeric and date/time fields. Many Data Fields, Data Types, and extractors in Grooper use the detected CultureCode to determine how to parse, validate, and format values.

  • Numeric Values:
    The interpretation of decimal separators, thousands separators, currency symbols, and number formatting depends on the culture. For example, 1,234.56 in en-US is equivalent to 1.234,56 in de-DE. If the wrong culture is applied, numbers may be misread or rejected.

  • Date and Time Values:
    Date formats vary widely by locale (e.g., MM/dd/yyyy in en-US vs. dd/MM/yyyy in en-GB). The Storage Type DateTime and related extraction logic use the detected culture to parse and display dates correctly. Incorrect culture settings can result in invalid or misinterpreted dates.

  • Extraction and Validation:
    Text Match extractors and Data Types can use culture-aware regular expressions, merge variables, and input filters to match values in the appropriate language or format. The CultureCode ensures that extraction logic is applied only to relevant documents and that values are interpreted as intended.

  • Result Normalization and Output:
    Result Set Options and Storage Types use the culture to format output values for display, export, or downstream processing. This includes applying the correct date, time, and number formats, as well as language-specific normalization or validation rules.

> Best Practice:
> Always run Detect Language before extraction or validation steps that depend on culture, especially when working > with multilingual or international document sets. This ensures that all downstream processing uses the correct > language and locale context for accurate results.

Diagnostic Artifacts

The following diagnostic artifacts are generated and can be reviewed for troubleshooting or auditing:

  • Chat Log.jsonl: The full chat log of the conversation with the LLM for each detection operation.
  • Log Entries: Usage statistics and key decisions are written to the diagnostic log.

Notes and Considerations

  • The accuracy of detection depends on the quality and representativeness of the sampled text, as well as the capabilities of the selected LLM model.
  • Using a larger 'Page Depth' may improve results for longer or mixed-language documents, but increases processing time and token usage.
  • Locale detection is recommended when downstream processes are sensitive to regional spelling, formatting, or dialect differences.
  • If an error occurs (such as no valid JSON returned), the activity will raise an exception and halt processing for the affected item.

Properties

NameTypeDescription
General
Processing Options

See Also

Used By

Notification