Find the text of the given tag using BeautifulSoup

The get_text() method in the Python BeautifulSoup library is very useful for extracting text from HTML and XML documents. However, there are some nuances to how it works that are good to understand when using it for web scraping or text extraction.

What get_text() Does

The get_text() method returns all the text from a document or tag, stripping out any HTML tags or markup. For example:

from bs4 import BeautifulSoup

html = '<p>This is a <b>paragraph</b> with <a href="#">a link</a>.</p>'
soup = BeautifulSoup(html, 'html.parser')

print(soup.get_text())

# Outputs: This is a paragraph with a link.

So it extracts just the raw text content.

Stripping Whitespace

get_text() strips and collapses multiple whitespace characters by default. So any extra spaces, newlines, tabs etc. are condensed down to single spaces in the output.

You can use the strip argument to control whitespace handling. Setting strip=False will keep all whitespace intact.

Handling Nested Tags

get_text() recursively extracts text from child tags by default. For example:

html = '<div><p>Paragraph 1</p><p>Paragraph 2</p></div>'
soup = BeautifulSoup(html, 'html.parser')

print(soup.get_text())

# Outputs:
# Paragraph 1
# Paragraph 2

The text of both

tags is extracted even though they are nested within the

Invisible Text

Some text like scripts and styles is ignored by default. You can use get_text(strip=False) to include invisible text.

Multiple vs First Text Nodes

Calling get_text() on a tag gets all text within it. To just get the first text node, use .text instead.

Conclusion

While get_text() is generally straightforward, properly handling whitespace, nesting, and invisible text takes some care. Read the documentation closely when extracting text from complex documents.

Find the text of the given tag using BeautifulSoup

What get_text() Does

Stripping Whitespace

Handling Nested Tags

Invisible Text

Multiple vs First Text Nodes

Conclusion

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Find the text of the given tag using BeautifulSoup

What get_text() Does

Stripping Whitespace

Handling Nested Tags

Invisible Text

Multiple vs First Text Nodes

Conclusion

The easiest way to do Web Scraping

Don't leave just yet!