Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

HTML Document - Condition HTML

HTML Document Command Grooper.Messaging

Performs cleanup and normalization of HTML documents.

Remarks

The Condition HTML command is used to clean, normalize, and optimize HTML documents before they are indexed or processed by downstream activities such as vector indexing and Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs).

This command is especially useful when ingesting web content, as it allows you to:

  • Remove unwanted or irrelevant HTML elements (such as navigation menus, sidebars, advertisements, or footers) using CSS-style selectors.
  • Replace the entire <body> of the HTML document with a specific element (e.g., the main content area), which is helpful for focusing on the most relevant information and discarding headers/footers.
  • Convert all relative URLs (e.g., in links or images) to absolute URLs using a specified site root, ensuring that links remain functional when content is viewed or processed outside its original context.
  • Apply custom attribute rules to HTML tags, such as adding classes or data attributes for downstream styling or processing.
  • Wrap specific text patterns in custom HTML tags, which can be used to highlight, annotate, or segment content for improved search and retrieval.

###Typical Use Cases

  • Web Content Ingestion: Clean up crawled or scraped web pages to remove navigation, ads, or other boilerplate before indexing.
  • RAG/LLM Optimization: Present only the most relevant, context-rich content to the LLM, improving retrieval accuracy and reducing noise.
  • Link Normalization: Ensure that all hyperlinks and media references are absolute, so that content remains portable and functional in Grooper or exported environments.
  • Custom Tagging: Add semantic tags or attributes to support downstream classification, extraction, or UI rendering.

###Best Practices

  • Use the BodySelector to focus on the main content area (e.g., "main", ".article-content").
  • Use the RemovalSelector to strip out repetitive or irrelevant elements (e.g., "nav, .sidebar, .ad-banner").
  • Set SiteUrl to the root of the source website to convert all relative links to absolute.
  • Define AttributeRules and WrapRules to further enrich or segment the HTML as needed for your use case.

###Example Conditioning Strategies

  • Remove navigation and ads: RemovalSelector = "nav, .sidebar, .ad"
  • Focus on main article: BodySelector = "main"
  • Normalize links: SiteUrl = "https://example.com"
  • Highlight keywords: Add a WrapRule to wrap "Important" in &lt;mark&gt; tags.

Proper HTML conditioning improves retrieval performance, presents cleaner and more contextually rich data to the LLM, and ultimately generates more accurate and relevant responses during RAG operations. It also helps ensure that the ingested content is consistent, portable, and ready for further processing in Grooper's search and analytics pipeline.

Properties

NameTypeDescription

See Also

Notification