Grooper Help - Version 25.0
25.0.0017 2,127
  • Overview
  • Help Status

Query HTML

Value Extractor Grooper.Messaging

Extracts values from an HTML document using a CSS or XPath selector.

Remarks

Overview

The Query HTML extractor enables you to extract text or attribute values from HTML documents using flexible selector-based queries. It is designed for scenarios where you need to retrieve structured or semi-structured data from HTML emails, web pages, or embedded HTML content in documents. By configuring a selector and optional filtering or parsing patterns, you can target specific elements and extract the exact values needed for your data model.

How It Works

Extraction is performed in several steps:

  • Selector Type: Choose between CSS or XPath to define how elements are matched in the HTML.
  • Selector: Enter the selector string to target elements.
    • CSS example: 'a[href^="mailto:"]'
    • XPath example: '//a[starts-with(@href, 'mailto:')]'
  • Attribute Name: Optionally specify an attribute (e.g., 'href', 'src') to extract its value. If left blank, the inner text of each matching element is returned.
  • Text Nodes Only: When enabled, only the direct text children (immediate '#text' nodes) of the selected element are extracted, ignoring nested child elements.
  • Filter Pattern: Provide a regular expression to filter which matched elements are included.
    Example: '\d+' to include only values containing digits.
  • Parsing Pattern: Use a regular expression to extract specific substrings from the matched content.
    Example: 'mailto:(?<email>[^?]+)' to extract an email address from a 'mailto:' link.

The extraction process applies filtering before parsing. If both patterns are set, only elements passing the filter are parsed for output values.

Configuration Guidance

  • Select the appropriate selector type for your scenario. CSS selectors are often simpler for class or attribute-based queries, while XPath is more powerful for complex document structures.
  • Write selectors that precisely target the elements you want to extract. Test your selector using sample HTML to ensure it matches the intended nodes.
  • Use the 'Attribute Name' property to extract attribute values (such as URLs or IDs) instead of text.
  • Enable 'Text Nodes Only' when you need to exclude nested formatting or markup from the extracted value.
  • Apply a 'Filter Pattern' to exclude unwanted matches, such as elements without a required value format.
  • Use a 'Parsing Pattern' to extract a specific portion of the value, such as an email address or ID from a link.

Example Scenarios

Extracting email addresses from anchor tags:
This configuration will extract all email addresses from '<a href="mailto:...">' links in the HTML.

Extracting table cell values with a specific class:
This will extract the inner text of all '<td class="value">' elements.

Notes

  • If 'Attribute Name' is blank, the extractor returns the inner text of each matching element.
  • Enabling 'Text Nodes Only' restricts extraction to direct text children, excluding nested markup.
  • Filtering occurs before parsing; only elements passing the filter are parsed for output.
  • The extractor is ideal for scraping structured data from HTML emails, web pages, or embedded HTML content.

Diagnostics

The Query HTML extractor can log diagnostic information to assist with troubleshooting and validation. When diagnostics are enabled, the following artifacts may be generated:

  • Selector Match Log: Records the selector used and the number of elements matched in the HTML document.
  • Filter and Parsing Log: Captures the results of filtering and parsing operations, including which elements were included or excluded.
  • Output Values Log: Lists the final extracted values for review and validation.

Diagnostic files can be reviewed in Grooper's diagnostic tools or exported for further analysis.

> Tip: Use diagnostics to refine your selector, filter, and parsing patterns, and to validate the structure and content of extracted values.

Properties

NameTypeDescription

See Also

Used By

Notification