Find the text of the given tag using BeautifulSoup

Oct 6, 2023 ยท 2 min read

The get_text() method in the Python BeautifulSoup library is very useful for extracting text from HTML and XML documents. However, there are some nuances to how it works that are good to understand when using it for web scraping or text extraction.

What get_text() Does

The get_text() method returns all the text from a document or tag, stripping out any HTML tags or markup. For example:

from bs4 import BeautifulSoup

html = '<p>This is a <b>paragraph</b> with <a href="#">a link</a>.</p>'
soup = BeautifulSoup(html, 'html.parser')

print(soup.get_text())

# Outputs: This is a paragraph with a link.

So it extracts just the raw text content.

Stripping Whitespace

get_text() strips and collapses multiple whitespace characters by default. So any extra spaces, newlines, tabs etc. are condensed down to single spaces in the output.

You can use the strip argument to control whitespace handling. Setting strip=False will keep all whitespace intact.

Handling Nested Tags

get_text() recursively extracts text from child tags by default. For example:

html = '<div><p>Paragraph 1</p><p>Paragraph 2</p></div>'
soup = BeautifulSoup(html, 'html.parser')

print(soup.get_text())

# Outputs:
# Paragraph 1
# Paragraph 2

The text of both

tags is extracted even though they are nested within the

.

Invisible Text

Some text like scripts and styles is ignored by default. You can use get_text(strip=False) to include invisible text.

Multiple vs First Text Nodes

Calling get_text() on a tag gets all text within it. To just get the first text node, use .text instead.

Conclusion

While get_text() is generally straightforward, properly handling whitespace, nesting, and invisible text takes some care. Read the documentation closely when extracting text from complex documents.

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


Try ProxiesAPI for free

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
...

X

Don't leave just yet!

Enter your email below to claim your free API key: