Scraping Booking.com Property Listings with Rust in 2023

Oct 15, 2023 · 5 min read

In this article, we will learn how to scrape property listings from Booking.com using Rust. We will use the reqwest and select crates to fetch the HTML content and then extract key information like property name, location, ratings, etc.

Prerequisites

To follow along, you will need:

  • Rust installed on your system
  • Cargo package manager
  • Basic knowledge of Rust programming
  • Creating a New Project

    Let's start by creating a new Rust project:

    cargo new booking-scraper
    cd booking-scraper
    

    This will generate a simple Rust project for us to work with.

    Adding Dependencies

    Now we need to add the reqwest and select crates as dependencies in Cargo.toml:

    [dependencies]
    reqwest = "0.11"
    select = "0.5"
    

    Importing Crates

    At the top of main.rs, let's import the crates:

    use reqwest;
    use select::document::Document;
    use select::predicate::Name;
    

    reqwest will be used to make HTTP requests.

    select will help parse and query the HTML.

    Defining the Target URL

    Let's define the URL we want to scrape:

    let url = "<https://www.booking.com/searchresults.html?ss=New+York&>...";
    

    We won't paste the full URL here for brevity.

    Setting a User Agent

    We need to set a valid user agent header:

    let client = reqwest::Client::new();
    let headers = reqwest::header::HeaderMap::new();
    headers.insert("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36".parse().unwrap());
    

    This will make the request appear to come from a real browser.

    Fetching the HTML Page

    Now we can make the GET request:

    let res = client.get(url).headers(headers).send().await?;
    
    if res.status().is_success() {
      // Parse HTML
    } else {
      println!("Request failed");
    }
    

    We check that the request succeeded before parsing the HTML.

    Parsing the HTML

    To parse the HTML, we convert the response body to a string:

    let html = res.text().await?;
    let document = Document::from(html.as_str());
    

    This creates a select::Document from the HTML.

    Extracting Property Cards

    The property cards have a data-testid attribute we can search for:

    let property_cards = document.find(Name("div").and(Name("data-testid").equals("property-card")));
    

    This uses CSS selectors to find the matching

    elements.

    Looping Through Cards

    We can iterate through the cards:

    for card in property_cards {
    
      // Extract data from card
    
    }
    

    Inside this loop we will extract information from each card node.

    Extracting Title

    To get the title, we search for the data-testid="title" element:

    let title = card.find(Name("div").and(Name("data-testid").equals("title")))
                    .next()
                    .map(|n| n.text());
    

    We take the first matching node and get its text contents.

    Extracting Location

    Similarly, the address is under a data-testid="address" element:

    let location = card.find(Name("span").and(Name("data-testid").equals("address")))
                       .next()
                       .map(|n| n.text());
    

    The pattern is the same for other fields.

    Extracting Rating

    The star rating aria-label contains the score:

    let rating = card.find(Name("div").and(Class("e4755bbd60")))
                     .next()
                     .and_then(|n| n.attr("aria-label"));
    

    Here we get the aria-label attribute from the

    .

    Extracting Review Count

    The review count text is inside a class="abf093bdfe" element:

    let review_count = card.find(Class("abf093bdfe"))
                           .next()
                           .map(|n| n.text());
    

    Extracting Description

    The description is in a class="d7449d770c" element:

    let description = card.find(Class("d7449d770c"))
                          .next()
                          .map(|n| n.text());
    

    Printing the Data

    Finally, we can print out the extracted data:

    println!("Name: {}", title);
    println!("Location: {}", location);
    println!("Rating: {}", rating);
    // etc...
    

    Here is the full code.

    
    use reqwest;
    use select::document::Document;
    use select::predicate::{Class, Name};
    
    #[tokio::main]
    async fn main() -> Result<(), Box<dyn std::error::Error>> {
    
      let url = "https://www.booking.com/searchresults.en-gb.html?ss=New+York&checkin=2023-03-01&checkout=2023-03-05&group_adults=2";
      
      let client = reqwest::Client::new();
    
      let headers = reqwest::header::HeaderMap::new();
      headers.insert("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36".parse().unwrap());
    
      let res = client.get(url).headers(headers).send().await?;
    
      if res.status().is_success() {
    
        let html = res.text().await?;
    
        let document = Document::from(html.as_str());
    
        let property_cards = document.find(Name("div").and(Name("data-testid").equals("property-card")));
    
        for card in property_cards {
        
          let title = card.find(Name("div").and(Name("data-testid").equals("title")))
                          .next()
                          .map(|n| n.text());
    
          let location = card.find(Name("span").and(Name("data-testid").equals("address")))
                             .next()
                             .map(|n| n.text());
          
          let rating = card.find(Name("div").and(Class("e4755bbd60")))
                           .next()
                           .and_then(|n| n.attr("aria-label"));
                           
          let review_count = card.find(Class("abf093bdfe"))
                                 .next()
                                 .map(|n| n.text());
                                 
          let description = card.find(Class("d7449d770c"))
                                .next()
                                .map(|n| n.text());
    
          println!("Name: {}", title);
          println!("Location: {}", location);
          println!("Rating: {}", rating);
          println!("Review Count: {}", review_count); 
          println!("Description: {}", description);
        }
    
      } else {
        println!("Request failed");
      }
    
      Ok(())
    }
    
    

    And that covers the basics of scraping data from Booking.com in Rust! The same principles can be applied to any site.

    While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

    Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

    This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

    With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: