Getting Data out of URLs in 5 Easy Steps in Python

Feb 20, 2024 ยท 3 min read

URLs may seem like simple strings of text, but they actually contain a wealth of structured data. Being able to efficiently extract parts of a URL is an invaluable skill for any developer working with web technologies. In this guide, I'll walk you through 5 simple steps to extract hostnames, paths, query parameters, and more from URLs in your code.

1. Parse the URL into components

Most programming languages provide built-in libraries for parsing URLs. For example, in Python:

from urllib.parse import urlparse

url = 'https://www.example.com/path/to/page?foo=bar&baz=1'
parsed = urlparse(url)

This breaks the URL down into distinct components that we can access:

  • parsed.scheme - the protocol (https)
  • parsed.hostname - the domain name (www.example.com)
  • parsed.path - the path (/path/to/page)
  • parsed.query - the query string (?foo=bar&baz=1)
  • So already with the standard library we can easily extract the key parts of a URL.

    2. Get the query parameters

    To get at the data in the query string, we use the parse_qs method:

    from urllib.parse import parse_qs
    
    query = parse_qs(parsed.query)
    print(query['foo'][0]) # 'bar'  

    parse_qs gives us a dictionary where each value is a list, since duplicate keys are allowed in query strings.

    3. Validate the hostname

    Often you'll want to validate that a URL is intended for your site or API. To get the hostname:

    print(parsed.hostname) # 'www.example.com'

    And compare against a list of allowed hosts:

    ALLOWED_HOSTS = ['www.example.com', 'example.com']
    
    if parsed.hostname not in ALLOWED_HOSTS:
        raise ValueError('Invalid hostname')

    4. Extract parts of the path

    Paths can contain useful slugs and IDs. To extract a section:

    path = parsed.path # '/path/to/page' 
    
    parts = path.split('/')
    print(parts[2]) # 'page'

    Use os.path methods like basename and dirname to get just the last or first part of a path.

    5. Reconstruct the URL

    Once you've extracted the data you need, reconstruct the URL programatically:

    from urllib.parse import urlunparse
    
    url = urlunparse((
        parsed.scheme, 
        parsed.hostname,
        new_path, 
        parsed.params,
        parsed.query,
        parsed.fragment
    ))

    Putting the pieces back together makes it easy to modify URLs programmatically.

    Key Takeaways

  • Use the standard library URL parsing methods to break down and rebuild URLs
  • Extract query parameters into a dictionary
  • Validate hostnames against allowed domains
  • Get path components by splitting on '/'
  • Reconstruct URLs after modifying their parts
  • With these basic tools, you can efficiently extract all kinds of data from URLs in your code to power your web scraping, APIs, redirects, and more!

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: