Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

Lexical

ESP Classify Method Grooper.Core

Provides a classification method that assigns document types based on text and/or image features, supporting both training-based and rules-based approaches.

Remarks

The Lexical classification method enables Grooper to identify document types by analyzing the textual and visual content of documents. It supports both rules-based and training-based classification, allowing for flexible configuration to match a wide range of document sets.

Overview

  • Lexical classification can use hand-coded rules, machine learning from training samples, or a combination of both.
  • Rules-based classification uses 'Positive Extractor' and 'Negative Extractor' properties on Document Types to immediately include or exclude types based on specific content.
  • Training-based classification analyzes the frequency and distribution of features (such as words, n-grams, or tokens) to measure similarity between documents and trained Document Types.
  • Image-based features can also be incorporated using an IP Profile, enabling classification based on visual patterns or graphics.

Usage and Configuration

  • Assign Lexical as the 'Classification Method' on a Content Model to enable this approach.
  • Configure the 'Text Feature Extractor' to define which textual features are used for training and classification. This may include words, phrases, or custom tokens.
  • Optionally, set an 'Image Feature Extractor' (IP Profile) to include image-based features in the classification process.
  • Use the Train As and Train From commands to provide training samples for each Document Type.
  • Adjust weighting options such as 'Use Class Frequency', 'Use Confidence', and 'Frequency Scaling' to fine-tune the model's sensitivity to feature frequency and confidence.
  • Caching options ('Maximum Age', 'Maximum Idle Time') control how long the classification model remains in memory before being refreshed.

How Lexical Classification Works

  1. When classifying a Batch Folder or Batch Page, rules-based extractors are evaluated first. If a positive or negative match is found, the result is determined immediately.
  2. If no rules apply, the method extracts features from the document using the configured extractors.
  3. The extracted features are compared to the trained models for each Document Type, calculating a similarity score.
  4. The Document Type with the highest similarity (above any configured thresholds) is assigned to the document.
  5. If image features are enabled, they are combined with text features to improve accuracy for visually distinctive documents.

Practical Guidance

  • For small, structured document sets, rules-based classification is often sufficient and easy to configure.
  • For large or unstructured sets, training-based classification is recommended. Use high-quality, representative samples for best results.
  • Combine both approaches to maximize accuracy: use rules to handle clear cases and training to resolve ambiguous or complex documents.
  • Use stop word filtering and stemming in the 'Text Feature Extractor' to improve model quality and reduce noise.
  • Monitor and adjust weighting and caching settings as needed to balance performance and accuracy.

Related Concepts

For more information, see the documentation for Content Model, Document Type, Classify activity, and training commands.

Properties

NameTypeDescription
General
Weighting Options
Caching

See Also

Used By

Notification