Parsing HTML Tables with BeautifulSoup

Oct 6, 2023 ยท 2 min read

BeautifulSoup is a useful library for extracting data from HTML tables in Python. With a few simple lines of code, you can parse an HTML table and convert it into a pandas DataFrame for further analysis.

Parsing the Table

To parse an HTML table with BeautifulSoup, first load the HTML document and find the

tag.

You can then loop through each

row and
cell, appending the data to lists:

from bs4 import BeautifulSoup
import requests

url = '<https://example.com/table>'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')

table = soup.find('table')

rows = []
for row in table.find_all('tr'):
    rows.append([val.text for val in row.find_all('td')])

This gives you a list of lists containing each cell's text.

Converting to DataFrame

To convert to a pandas DataFrame, pass the list of rows along with column names:

import pandas as pd

df = pd.DataFrame(rows, columns=['Name', 'Age', 'Job'])
print(df)

The DataFrame will contain the nicely structured table data.

Extracting Attributes

You can also extract other attributes like href links from table cells:

rows = []
for row in table.find_all('tr'):
  cells = [cell.find('a').get('href') for cell in row.find_all('td')]
  rows.append(cells)

Converting Strings

To extract a table from a BeautifulSoup string, parse it first:

html = "<table>...</table>"
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')

Then continue parsing as normal.

In summary, BeautifulSoup makes extracting data from HTML tables very straightforward. Pairing it with pandas gives you powerful data analysis capabilities over scraped tabular data.

The easiest way to do Web Scraping

Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


Try ProxiesAPI for free

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
...

X

Don't leave just yet!

Enter your email below to claim your free API key: