Lexical

Inherits From ESP Classify Method Namespace Grooper.Core

Provides a classification method that assigns document types based on text and/or image features, supporting both training-based and rules-based approaches.

Remarks

The Lexical classification method enables Grooper to identify document types by analyzing the textual and visual content of documents. It supports both rules-based and training-based classification, allowing for flexible configuration to match a wide range of document sets.

Overview

Lexical classification can use hand-coded rules, machine learning from training samples, or a combination of both.
Rules-based classification uses 'Positive Extractor' and 'Negative Extractor' properties on Document Types to immediately include or exclude types based on specific content.
Training-based classification analyzes the frequency and distribution of features (such as words, n-grams, or tokens) to measure similarity between documents and trained Document Types.
Image-based features can also be incorporated using an IP Profile, enabling classification based on visual patterns or graphics.

Usage and Configuration

Assign Lexical as the 'Classification Method' on a Content Model to enable this approach.
Configure the 'Text Feature Extractor' to define which textual features are used for training and classification. This may include words, phrases, or custom tokens.
Optionally, set an 'Image Feature Extractor' (IP Profile) to include image-based features in the classification process.
Use the Train As and Train From commands to provide training samples for each Document Type.
Adjust weighting options such as 'Use Class Frequency', 'Use Confidence', and 'Frequency Scaling' to fine-tune the model's sensitivity to feature frequency and confidence.
Caching options ('Maximum Age', 'Maximum Idle Time') control how long the classification model remains in memory before being refreshed.

How Lexical Classification Works

When classifying a Batch Folder or Batch Page, rules-based extractors are evaluated first. If a positive or negative match is found, the result is determined immediately.
If no rules apply, the method extracts features from the document using the configured extractors.
The extracted features are compared to the trained models for each Document Type, calculating a similarity score.
The Document Type with the highest similarity (above any configured thresholds) is assigned to the document.
If image features are enabled, they are combined with text features to improve accuracy for visually distinctive documents.

Practical Guidance

For small, structured document sets, rules-based classification is often sufficient and easy to configure.
For large or unstructured sets, training-based classification is recommended. Use high-quality, representative samples for best results.
Combine both approaches to maximize accuracy: use rules to handle clear cases and training to resolve ambiguous or complex documents.
Use stop word filtering and stemming in the 'Text Feature Extractor' to improve model quality and reduce noise.
Monitor and adjust weighting and caching settings as needed to balance performance and accuracy.

Related Concepts

Content Model: Organizes document types and classification logic.
Document Type: The target type assigned by classification.
Batch Folder, Batch Page: The items being classified.
IP Profile: Used for extracting image-based features.
Train As, Train From: Commands for providing training data.
Classify: Activity or command that invokes classification.

For more information, see the documentation for Content Model, Document Type, Classify activity, and training commands.

Properties

Name Type Description

General

Text Feature Extractor

Value Extractor

►

Specifies the Value Extractor used to identify features for training-based classification.

Can be one of the following types:

Value	Description
Reference	Delegates extraction to another configured extractor, enabling reuse and centralization of extraction logic.
AI Column Extractor	Extracts structured content from documents with two-column layouts.
AI Schema Extractor	Extracts structured data from documents using a large language model (LLM) guided by a user-defined JSON schema.
Ask AI	Executes a completion using a large language model (LLM) and returns one hit for each choice in the response.
Detect Signature	Detects a signature within a specified region of a document page by measuring the percentage of the area that is filled.
Entity Recognition	Identifies and categorizes entities such as people, organizations, locations, and quantities in unstructured text.
Field Match	Matches the value stored in a previously-extracted field or table column.
Find Barcode	Searches for barcode values in document Layout Data previously detected during image processing.
Highlight Zone	Defines a region of a document to be visually highlighted, without extracting any data values.
Key Phrase Extraction	Identifies key concepts and topics in text using Azure AI Language key phrase extraction.
Label Match	Matches a list of one or more label values, using matching options defined by a Labeling Behavior.
Labeled OMR	Reads a group of one or more checkboxes located nearby text labels.
Labeled Value	Extracts a field presented as a label-value pair within a document, associating labels and values based on their spatial relationship.
List Match	Extracts values from document text that match any entry in a list of search terms.
Ordered OMR	Reads one or more checkboxes with a consistent order of appearance inside a rectangular region.
Pattern Match	Extracts values from document text that match a specified regular expression pattern.
Pii Entity Recognition	Identifies, categorizes, and redacts sensitive information (PII) in unstructured text using Azure AI Language Services.
Query HTML	Extracts values from an HTML document using a CSS or XPath selector.
Query XML	Extracts values from XML documents using XPATH queries, enabling structured data extraction from XML content in Grooper.
Read Barcode	Extracts barcode values from document images using configurable barcode recognition.
Read Metadata	Reads a metadata value from a document by accessing a property on an attachment or content link.
Read Zone	Extracts text content from a specified rectangular region (zone) of a document.
Select Page	Selects and outputs the full content of one or more pages from a document, based on page number and/or content criteria.
Word Match	Extracts individual words and multi-word phrases (N-grams) from document text for use in classification, data extraction, and normalization.
Zonal OMR	Reads one or more checkboxes using manually-configured zones.

The 'Text Feature Extractor' defines how Grooper analyzes the textual content of documents to extract features for classification. Each value matched by this extractor is considered a feature, and the set of features is used to compare documents to trained Document Types.

How It Works

During training, the extractor is applied to sample documents to build a feature profile for each Document Type.
During classification, the extractor is applied to the document being classified, and the resulting features are compared to the trained profiles.
Features can be individual words, n-grams (sequences of words), or custom tokens representing patterns, data types, or semantic elements.

Feature Types

Unigrams (Single Words):
The most common feature type. Each word is treated as a feature.
- Tip: Use stop word filtering to remove common words (e.g., "and", "the") that do not help distinguish document types.
- Tip: Enable stemming to reduce words to their root form (e.g., "insured", "insuring" → "insur").
n-Grams (Bigrams, Trigrams, etc.):
Sequences of adjacent words. Useful for capturing phrases or context (e.g., "oil well").
- Note: n-gram extraction increases processing time and should be used only when needed.
Custom Tokens:
Extractors can match patterns such as VIN numbers, dates, or other entities and return a token (e.g., "VIN_Number") as a feature.

Configuration Guidance

Assign a Value Extractor that matches the most distinctive and relevant features for your document set.
Use built-in extractors for common patterns, or create custom extractors for specialized needs.
Combine multiple extraction strategies (e.g., words + tokens) for best results.
For best accuracy, filter out noise (stop words) and normalize features (stemming, case normalization).

Practical Examples

Simple Word Extraction:
Use a pattern match extractor to find all words, excluding stop words.
Phrase Extraction:
Configure the extractor to return bigrams or trigrams for documents where phrases are more distinctive than single words.
Entity Extraction:
Use a regular expression to match VIN numbers, invoice numbers, or other identifiers, and return a token for each match.

Best Practices

For small, structured document sets, focus on unique words or phrases.
For large or unstructured sets, use a combination of word, phrase, and entity extraction.
Regularly review and refine the extractor as your document set evolves.

For more information, see the documentation for Value Extractor, Document Type, and Content Model.

Image Feature Extractor

IP Profile

►

An optional IP Profile used to extract image-based features for classification.

EPI Extractor

Value Extractor

►

Defines an extractor which is used by ESP Auto Separation to find page numbers embedded in the document content.

Can be one of the following types:

Value	Description
Reference	Delegates extraction to another configured extractor, enabling reuse and centralization of extraction logic.
AI Column Extractor	Extracts structured content from documents with two-column layouts.
AI Schema Extractor	Extracts structured data from documents using a large language model (LLM) guided by a user-defined JSON schema.
Ask AI	Executes a completion using a large language model (LLM) and returns one hit for each choice in the response.
Detect Signature	Detects a signature within a specified region of a document page by measuring the percentage of the area that is filled.
Entity Recognition	Identifies and categorizes entities such as people, organizations, locations, and quantities in unstructured text.
Field Match	Matches the value stored in a previously-extracted field or table column.
Find Barcode	Searches for barcode values in document Layout Data previously detected during image processing.
Highlight Zone	Defines a region of a document to be visually highlighted, without extracting any data values.
Key Phrase Extraction	Identifies key concepts and topics in text using Azure AI Language key phrase extraction.
Label Match	Matches a list of one or more label values, using matching options defined by a Labeling Behavior.
Labeled OMR	Reads a group of one or more checkboxes located nearby text labels.
Labeled Value	Extracts a field presented as a label-value pair within a document, associating labels and values based on their spatial relationship.
List Match	Extracts values from document text that match any entry in a list of search terms.
Ordered OMR	Reads one or more checkboxes with a consistent order of appearance inside a rectangular region.
Pattern Match	Extracts values from document text that match a specified regular expression pattern.
Pii Entity Recognition	Identifies, categorizes, and redacts sensitive information (PII) in unstructured text using Azure AI Language Services.
Query HTML	Extracts values from an HTML document using a CSS or XPath selector.
Query XML	Extracts values from XML documents using XPATH queries, enabling structured data extraction from XML content in Grooper.
Read Barcode	Extracts barcode values from document images using configurable barcode recognition.
Read Metadata	Reads a metadata value from a document by accessing a property on an attachment or content link.
Read Zone	Extracts text content from a specified rectangular region (zone) of a document.
Select Page	Selects and outputs the full content of one or more pages from a document, based on page number and/or content criteria.
Word Match	Extracts individual words and multi-word phrases (N-grams) from document text for use in classification, data extraction, and normalization.
Zonal OMR	Reads one or more checkboxes using manually-configured zones.

The provided Extractor must define and output a group named 'PageNo', and optionally may define and output a group named 'PageCount'. For example, if the document set contains page numbering like 'Page 1 of 4', the following pattern would generate the required group names: Page (?<PageNo>\d+) of (?<PageCount>\d+).

Bullet Extractor

Value Extractor

►

Defines an extractor which is used to capture bullet numbers (or letters) embedded in the document.

Can be one of the following types:

Value	Description
Field Match	Matches the value stored in a previously-extracted field or table column.
Detect Signature	Detects a signature within a specified region of a document page by measuring the percentage of the area that is filled.
Labeled Value	Extracts a field presented as a label-value pair within a document, associating labels and values based on their spatial relationship.
Label Match	Matches a list of one or more label values, using matching options defined by a Labeling Behavior.
Read Metadata	Reads a metadata value from a document by accessing a property on an attachment or content link.
Select Page	Selects and outputs the full content of one or more pages from a document, based on page number and/or content criteria.
Find Barcode	Searches for barcode values in document Layout Data previously detected during image processing.
Highlight Zone	Defines a region of a document to be visually highlighted, without extracting any data values.
Labeled OMR	Reads a group of one or more checkboxes located nearby text labels.
Ordered OMR	Reads one or more checkboxes with a consistent order of appearance inside a rectangular region.
Read Barcode	Extracts barcode values from document images using configurable barcode recognition.
Read Zone	Extracts text content from a specified rectangular region (zone) of a document.
Reference	Delegates extraction to another configured extractor, enabling reuse and centralization of extraction logic.
Zonal OMR	Reads one or more checkboxes using manually-configured zones.
List Match	Extracts values from document text that match any entry in a list of search terms.
Pattern Match	Extracts values from document text that match a specified regular expression pattern.
Word Match	Extracts individual words and multi-word phrases (N-grams) from document text for use in classification, data extraction, and normalization.
Query XML	Extracts values from XML documents using XPATH queries, enabling structured data extraction from XML content in Grooper.
Query HTML	Extracts values from an HTML document using a CSS or XPath selector.
AI Schema Extractor	Extracts structured data from documents using a large language model (LLM) guided by a user-defined JSON schema.
Ask AI	Executes a completion using a large language model (LLM) and returns one hit for each choice in the response.
AI Column Extractor	Extracts structured content from documents with two-column layouts.
Entity Recognition	Identifies and categorizes entities such as people, organizations, locations, and quantities in unstructured text.
Key Phrase Extraction	Identifies key concepts and topics in text using Azure AI Language key phrase extraction.
Pii Entity Recognition	Identifies, categorizes, and redacts sensitive information (PII) in unstructured text using Azure AI Language Services.

The bullet extractor is used during ESP Auto Separation when analyzing Unstructured Document Types. In some cases, it is possible to determine that two pages are part of the same document based on the numbering in bulleted lists. For example, consider a page containing the bullet numbers shown below:

1. Definitions. Paragraph text...
2. Parties. Paragraph text...
3. Effective Date. Paragraph text...

If the following page contains these bullet numbers, this is evidence that both pages are part of the same document.

4. Severability. Paragraph text...
5. Term. Paragraph text...
6. Confidentiality. Paragraph text...

The bullet extractor should return only individual letters or numeric digits. Surrounding punctuation should excluded from the output value.

Weighting Options

Use Class Frequency

Boolean

►

When enabled, class frequency (CF) is considered in feature weighting for classification.

Use Confidence

Boolean

►

Specifies whether the confidence of each feature instance is used in term frequency calculations.

Term Frequency Mode

TfModes

►

Specifies how the term frequency (TF) of features should be calculated for classification models.

Can be one of the following values:

Name	Value	Description
Normal	0	Normalizes term frequency to the size of the document. Each feature's count is divided by the total number of features in the document. This mode provides matching that is independent of document size, so short and long documents are compared on equal footing. Note: In some cases, this can produce false positives, as a very short document may match a long document with high similarity if the same features are present.	►
Logarithmic	1	Scales term frequency logarithmically. Applies a logarithmic function to feature counts, reducing the impact of very frequent features. Useful for legacy models or when you want to minimize the influence of repeated features. Provided primarily for backward compatibility.	►
Augmented	2	Normalizes term frequency to the most-frequent feature in the document, with optional dampering. Each feature's count is divided by the count of the most frequent feature, then scaled by the 'Frequency Scaling' property. This mode allows document size and feature repetition to play a greater role in classification. The dampering factor ('Frequency Scaling') controls how much frequency matters: higher values increase the effect, lower values reduce it. Useful when you want to tune the sensitivity to repeated features or document length.	►

The 'TfModes' enumeration controls the method used to compute term frequency (TF) for features during document classification. The choice of mode affects how feature occurrences are weighted, which in turn influences similarity calculations between documents and trained Document Types.

Overview

Term Frequency (TF): Measures how often a feature (such as a word or token) appears in a document, relative to the document's size or other features.
The selected mode determines how TF is normalized or scaled, impacting sensitivity to document length, feature repetition, and feature prominence.

Available Modes

Normal:
Normalizes feature counts by the total number of features in the document, making TF independent of document size.
Logarithmic:
Applies a logarithmic scale to feature counts, reducing the impact of very frequent features. Provided for backward compatibility.
Augmented:
Normalizes feature counts by the most frequent feature in the document, allowing document size to play a greater role in classification. Includes a dampering factor controlled by the 'Frequency Scaling' property.

Practical Guidance

Use 'Normal' for most scenarios where document length should not affect classification.
Use 'Logarithmic' for legacy models or when you want to reduce the influence of highly repetitive features.
Use 'Augmented' when you want longer documents or repeated features to have more influence, or when tuning with the 'Frequency Scaling' property.

Frequency Scaling

Double

►

When 'Term Frequency Mode' is set to 'Augmented', controls how much feature frequency affects classification.

Document Frequency Mode

IdfModes

►

Specifies how the inverse document frequency (IDF) of features should be calculated for classification models.

Can be one of the following values:

Name	Value	Description
Normal	0	Standard IDF calculation. Calculates IDF as `log(TotalClasses / ClassCount)`, where 'TotalClasses' is the number of Document Types and 'ClassCount' is the number of types containing the feature. Features that appear in only one Document Type receive the highest weight. Use this mode when you want rare features to have maximum influence on classification.	►
Smooth	1	Smoothed IDF calculation. Calculates IDF as `log(1 + TotalClasses / ClassCount)`, adding a smoothing factor to prevent division by zero and reduce extreme weights. Useful when the number of Document Types is small or when you want to avoid overemphasizing rare features. Provides more stable and balanced feature weighting in edge cases.	►

The 'IdfModes' enumeration controls the method used to compute inverse document frequency (IDF) for features during document classification. IDF measures how unique or common a feature is across all Document Types, and is a key component in weighting features for similarity calculations.

Overview

Inverse Document Frequency (IDF):
Reduces the weight of features that are common across many Document Types, and increases the weight of features that are rare or distinctive.
The selected mode determines whether standard or smoothed IDF is used, which can affect the handling of rare or ubiquitous features.

Available Modes

Normal:
Uses the standard IDF calculation, which may assign very high weights to features that appear in only one Document Type.
Smooth:
Adds smoothing to the IDF calculation, preventing extreme weights for rare features and improving stability when the number of Document Types is small.

Practical Guidance

Use 'Normal' for most scenarios where the number of Document Types is moderate to large and rare features should be highly weighted.
Use 'Smooth' when you want to avoid extreme weighting for features that appear in only one or very few Document Types, or when working with small sets of types.

For more information, see the documentation for Lexical, Document Type, and 'Document Frequency Mode'.

Caching

Maximum Age

String

►

Specifies the maximum time the classification model remains cached before being reloaded.

Maximum Idle Time

String

►

Specifies the maximum period of inactivity before the classification model is reloaded.

Used By

Content Model

Lexical

Remarks

Overview

Usage and Configuration

How Lexical Classification Works

Practical Guidance

Related Concepts

Properties

How It Works

Feature Types

Configuration Guidance

Practical Examples

Best Practices

What is Class Frequency (CF)?

How It Works

When to Enable

Practical Guidance

References

What is Feature Confidence?

How It Works

Assigning Multipliers with ResultSetOptions

When to Enable

Practical Guidance

Example

Overview

Available Modes

Practical Guidance

Purpose

How It Works

Practical Guidance

Example

Overview

Available Modes

Practical Guidance

See Also

Used By