How to Select Elements by Text in XPath

Jan 9, 2024 ยท 2 min read

XPath is a powerful language that's used for navigating through and selecting nodes in an XML document. In the context of web scraping, it's often used with HTML documents to select elements based on their text content. There are two primary ways you might want to select text in XPath: using the contains function or an exact match.

1. Using the contains Function

The contains function is used to select elements that contain a certain substring within their text nodes.

Example Code:

import requests
from lxml import etree

# Fetching the HTML content
html_content = requests.get('<https://example.com>').content

# Parsing the HTML content with lxml
dom = etree.HTML(html_content)

# XPath query using contains
elements_with_text = dom.xpath('//*[contains(text(), "example")]')

for element in elements_with_text:
    print(element.text)

In this example, replace "example" with the substring you are looking for. The code fetches the HTML content of the example website, parses it, and then uses an XPath query to select all elements that contain the specified text.

2. Using Exact Match

The exact match is used when you want to select elements that contain exactly and only the specified text.

Example Code:

import requests
from lxml import etree

# Fetching the HTML content
html_content = requests.get('<https://example.com>').content

# Parsing the HTML content with lxml
dom = etree.HTML(html_content)

# XPath query using exact match
exact_elements = dom.xpath('//*[text() = "Example Domain"]')

for element in exact_elements:
    print(element.text)

In this example, replace "Example Domain" with the exact string you want to match. Like the previous example, it fetches and parses the HTML content but uses an XPath query to select all elements that exactly match the specified text.

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


Try ProxiesAPI for free

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
...

X

Don't leave just yet!

Enter your email below to claim your free API key: