Retrieving and Parsing Text from URLs with Python's urllib

Feb 8, 2024 ยท 2 min read

The urllib module in Python provides useful tools for retrieving and parsing content from URLs. It comes built-in with Python, making it easy to access in your code.

Fetching Text Content

To fetch text content from a URL, you can use urllib.request.urlopen():

import urllib.request

with urllib.request.urlopen('http://example.com') as response:
    html = response.read()

This opens the URL, downloads the response content as bytes, and stores it in the html variable.

You can also read line by line by treating the response as a file object:

with urllib.request.urlopen('http://example.com') as response:
    for line in response:
        print(line)

Parsing Text

Once you have retrieved the text content, you may want to parse it to extract relevant information.

For example, to parse HTML you can use a parser like Beautiful Soup. To parse JSON, you can use the built-in json module.

Here's an example parsing JSON from a URL:

import json
import urllib.request 

with urllib.request.urlopen("http://api.example.com") as url:
    data = json.loads(url.read().decode())
    print(data["key"])

This fetches the JSON data, decodes the bytes to text, parses it to a Python dict with json.loads(), and accesses a key's value.

Handling Errors

Make sure to wrap calls to urlopen() in try/except blocks to handle errors gracefully:

try:
    with urllib.request.urlopen('http://example.com') as response:
        # Code here   
except urllib.error.URLError as e:
    print(f"URL Error: {e.reason}")

This way you can catch common issues like connection issues, HTTP errors, redirect loops, etc.

Overall, urllib offers a straightforward way to programmatically access text content from the web in Python without needing third-party libraries.

Browse by tags:

Browse by language:

The easiest way to do Web Scraping

Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


Try ProxiesAPI for free

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
...

X

Don't leave just yet!

Enter your email below to claim your free API key: