Web Scraping Property Listings from Booking.com with Python in 2023

Oct 15, 2023 · 9 min read

In this article, we will learn how to scrape property listings from Booking.com using Python. We will use the requests and Beautiful Soup libraries to fetch the HTML content and then extract key information like property name, location, ratings, etc.

Prerequisites

To follow along, you will need:

  • Python 3 installed on your system
  • Basic knowledge of Python programming
  • pip installed to install Python packages
  • Installing Dependencies

    We will use two Python packages - requests and BeautifulSoup for scraping.

    To install them:

    pip install requests beautifulsoup4
    

    This will download and install the latest versions of these libraries.

    Importing Modules

    At the top of your script, import the necessary modules:

    import requests
    from bs4 import BeautifulSoup
    

    requests will be used to send HTTP requests to fetch webpages.

    BeautifulSoup will help parse and extract information from the HTML content.

    Defining the Target URL

    We will scrape property listings from this URL on Booking.com:

    url = "<https://www.booking.com/searchresults.en-gb.html?ss=Manhattan%2C+New+York%2C+New+York+State%2C+United+States&map=1&efdco=1&label=gen173nr-1BCAEoggI46AdIM1gEaGyIAQGYAQm4AQfIAQzYAQHoAQGIAgGoAgO4AoH47qgGwAIB0gIkNTU1NTYzMjUtMzFhMS00NzZhLTg5YTQtZGFhOTNiZDRlMDI32AIF4AIB&sid=10b122fa3c8cf51d65c8a9e1ceb2fbcf&aid=304142&lang=en-gb&sb=1&src_elem=sb&src=index&dest_id=929&dest_type=district&ac_position=1&ac_click_type=b&ac_langcode=en&ac_suggestion_list_length=5&search_selected=true&search_pageview_id=7ae531418d840184&ac_meta=GhA3YWU1MzE0MThkODQwMTg0IAEoATICZW46Bm5ldyB5b0AASgBQAA%3D%3D&checkin=2023-11-08&checkout=2023-11-11&group_adults=2&no_rooms=1&group_children=0&sb_travel_purpose=leisure#map_closed>"
    

    This URL fetches property listings for Manhattan, New York for given check-in and check-out dates.

    You can modify the parameters like location, dates, number of people etc to fetch listings as per your need.

    Setting User-Agent Header

    Many websites block requests without a valid User-Agent string. So we need to set it to mimic a real browser:

    headers = {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    }
    

    Replace this with any modern browser's user agent string.

    Fetching the HTML Page

    With the URL and headers defined, we can use the requests.get() method to fetch the HTML content:

    response = requests.get(url, headers=headers)
    

    This sends a GET request to the URL and stores the response in the response variable.

    We should check that the request was successful before proceeding:

    if response.status_code == 200:
      # Success!
      # Parse HTML here
    else:
      print("Failed to fetch webpage")
    

    A status code of 200 means the request was successful. Any other code means an error occurred.

    Parsing the HTML

    With the HTML content fetched, we can use Beautiful Soup to parse and extract information from it:

    soup = BeautifulSoup(response.text, "html.parser")
    

    This parses the response.text using the HTML parser.

    We now have a BeautifulSoup object that represents the document structure.

    Extracting Property Cards

    The property listings on Booking.com are contained in div elements with a data-testid attribute value of property-card.

    We can use Beautiful Soup's find_all() method to find all such elements:

    property_cards = soup.find_all("div", {"data-testid": "property-card"})
    

    This gives us a list of all property cards on the page.

    Looping Through Property Cards

    With the list of cards, we can loop through each one to extract information:

    for card in property_cards:
    
      # Extract information from card here
    

    Inside this loop, we will locate specific elements and extract their text.

    Saving the Raw HTML

    It's useful to save the raw HTML of each card before we start extracting text. This serves as a backup if we ever need to re-extract information.

    filename = f"property_card.html"
    card_html = str(card)
    
    with open(filename, "w", encoding="utf-8") as file:
      file.write(card_html)
    

    This converts the BeautifulSoup card object to a string and writes it to a file.

    The filename is dynamically generated for each card.

    Extracting Property Name

    To extract the property name, we find the div with data-testid set to title:

    title_element = card.find("div", {"data-testid": "title"})
    
    if title_element is not None:
      property_name = title_element.text.strip()
    else:
      property_name = "Title not found"
    

    We first try to find the element. If found, we take its .text property and call .strip() to remove extra whitespace.

    If the title element doesn't exist, we set a default value.

    Extracting Location

    The location string is stored in a span with data-testid set to address:

    property_location = card.find("span", {"data-testid": "address"}).text.strip()
    

    Similar to before, we find the element and extract the text inside it.

    Extracting Star Rating

    The star rating element has a class name we can search for:

    star_rating_div = card.find("div", {"class": "e4755bbd60"})
    
    if star_rating_div:
      star_rating = star_rating_div["aria-label"]
    else:
      star_rating = "Star rating not found"
    

    If the div is found, we extract the aria-label attribute which contains the rating value.

    Extracting Review Count

    The review count is stored in a div with class abf093bdfe:

    review_count_div = card.find("div", {"class": "abf093bdfe"})
    
    if review_count_div:
      review_count = review_count_div.text.strip()
    else:
      review_count = "Review count not available"
    

    Again, we try to extract the text if the element is found, else set a default value.

    Extracting Description

    The property description is contained in a div with class d7449d770c:

    description = card.find("div", {"class": "d7449d770c"}).text.strip()
    

    The rest of the logic is similar to what we've seen before.

    Printing the Extracted Data

    Finally, we can print the extracted information from each card:

    print("Name:", property_name)
    print("Location:", property_location)
    print("Rating:", star_rating)
    print("Review Count:", review_count)
    print("Description:", description)
    print("-" * 40) # separator
    

    This will nicely print out the key details for each property listing.

    You can also store this data in a dictionary or other data structure instead of printing.

    Full Code

    For your reference, here is the full web scraping script:

    import requests
    from bs4 import BeautifulSoup
    
    url = "<https://www.booking.com/searchresults.en-gb.html?ss=Manhattan%2C+New+York%2C+New+York+State%2C+United+States&map=1&efdco=1&label=gen173nr-1BCAEoggI46AdIM1gEaGyIAQGYAQm4AQfIAQzYAQHoAQGIAgGoAgO4AoH47qgGwAIB0gIkNTU1NTYzMjUtMzFhMS00NzZhLTg5YTQtZGFhOTNiZDRlMDI32AIF4AIB&sid=10b122fa3c8cf51d65c8a9e1ceb2fbcf&aid=304142&lang=en-gb&sb=1&src_elem=sb&src=index&dest_id=929&dest_type=district&ac_position=1&ac_click_type=b&ac_langcode=en&ac_suggestion_list_length=5&search_selected=true&search_pageview_id=7ae531418d840184&ac_meta=GhA3YWU1MzE0MThkODQwMTg0IAEoATICZW46Bm5ldyB5b0AASgBQAA%3D%3D&checkin=2023-11-08&checkout=2023-11-11&group_adults=2&no_rooms=1&group_children=0&sb_travel_purpose=leisure#map_closed>"
    
    headers = {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    }
    
    response = requests.get(url, headers=headers)
    
    if response.status_code == 200:
    
      soup = BeautifulSoup(response.text, "html.parser")
    
      property_cards = soup.find_all("div", {"data-testid": "property-card"})
    
      for card in property_cards:
    
        filename = f"property_card.html"
        card_html = str(card)
    
        with open(filename, "w", encoding="utf-8") as file:
          file.write(card_html)
    
        title_element = card.find("div", {"data-testid": "title"})
        if title_element is not None:
          property_name = title_element.text.strip()
        else:
          property_name = "Title not found"
    
        property_location = card.find("span", {"data-testid": "address"}).text.strip()
    
        star_rating_div = card.find("div", {"class": "e4755bbd60"})
        if star_rating_div:
          star_rating = star_rating_div["aria-label"]
        else:
          star_rating = "Star rating not found"
    
        review_count_div = card.find("div", {"class": "abf093bdfe"})
        if review_count_div:
          review_count = review_count_div.text.strip()
        else:
          review_count = "Review count not available"
    
        description = card.find("div", {"class": "d7449d770c"}).text.strip()
    
        print("Name:", property_name)
        print("Location:", property_location)
        print("Rating:", star_rating)
        print("Review Count:", review_count)
        print("Description:", description)
        print("-" * 40)
    
    else:
      print("Failed to fetch webpage")
    

    And that's it! We've written a complete web scraper to extract hotel listings from Booking.com using Python. The same technique can be applied to scrape any site.

    Hope this gives you a solid understanding of how to use requests and BeautifulSoup for web scraping.

    While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

    Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

    This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

    With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: