Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

Rules-Based

ESP Classify Method Grooper.Core

Classifies documents using rules defined on Document Type objects, enabling immediate and deterministic type assignment based on content extractors.

Remarks

The Rules-Based classification method assigns Document Types by evaluating explicit rules configured on each type, rather than relying on training or statistical models.

Overview

  • Rules-Based classification uses 'Positive Extractor', 'Negative Extractor', and optionally 'Secondary Page Extractor' properties on each Document Type to determine matches.
  • Classification is immediate and deterministic: if a rule matches, the corresponding Document Type is assigned without further analysis.
  • This approach is ideal for document sets with clear, consistent identifiers or patterns that can be reliably extracted.

How It Works

  1. For each Document Type in scope, the system evaluates the configured extractors against the document content.
  2. If a 'Negative Extractor' finds a match, the type is excluded from consideration.
  3. If a 'Positive Extractor' finds a match, the type is considered a candidate, and the highest confidence score is recorded.
  4. For page-level classification, a 'Secondary Page Extractor' can be used to identify document types on subsequent pages.
  5. The Document Type with the highest confidence (or the first positive match, depending on configuration) is assigned to the document or page.

Configuration Guidance

  • Define 'Positive Extractor' rules to identify key phrases, patterns, or features unique to each Document Type.
  • Use 'Negative Extractor' rules to explicitly exclude types when certain content is present.
  • For multi-page documents, configure 'Secondary Page Extractor' to recognize types on non-first pages.
  • Rules can be based on regular expressions, value lists, or any supported Value Extractor.

When to Use

  • Recommended for small, structured, or highly consistent document sets where type can be determined by explicit content rules.
  • Useful for rapid prototyping, quality control, or as a fallback when training data is insufficient for statistical classification.
  • Can be combined with training-based methods in a hybrid approach, using rules to handle clear cases and training for ambiguous or complex documents.

Practical Notes

  • Rules-based classification is fast and transparent, making it easy to understand and troubleshoot classification decisions.
  • Maintenance is straightforward: update extractors as document formats or requirements change.
  • For best results, ensure that rules are mutually exclusive or provide clear precedence to avoid ambiguous matches.

Properties

NameTypeDescription

See Also

Used By

Notification