Scraping Booking.com Property Listings with PHP in 2023

Oct 15, 2023 · 5 min read

In this article, we will learn how to scrape property listings from Booking.com using PHP. We will use common PHP libraries to fetch the HTML content and then parse and extract key information like property name, location, ratings, etc.

Prerequisites

To follow along, you will need:

  • PHP 7.0 or higher
  • Composer installed to add PHP packages
  • Basic knowledge of PHP and HTML
  • Installing Dependencies

    We will use two PHP packages - Guzzle for sending HTTP requests and Symfony DomCrawler for parsing HTML.

    Install them using Composer:

    composer require guzzlehttp/guzzle symfony/dom-crawler
    

    This will download the packages into the vendor folder.

    Including Dependencies

    At the top of your PHP script, include the Composer autoloader and the packages:

    require __DIR__ . '/vendor/autoload.php';
    
    use GuzzleHttp\\Client;
    use Symfony\\Component\\DomCrawler\\Crawler;
    

    The autoloader will load the classes when needed.

    Defining the Target URL

    We will scrape listings from this URL on Booking.com:

    $url = '<https://www.booking.com/searchresults.en-gb.html?ss=New+York&checkin=2023-03-01&checkout=2023-03-05&group_adults=2>';
    

    You can modify the parameters as needed.

    Setting User Agent

    We need to set a valid User Agent string:

    $userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36';
    

    Fetching the HTML Page

    Use Guzzle to send a GET request and get the response:

    $client = new Client(['headers' => ['User-Agent' => $userAgent]]);
    
    $response = $client->request('GET', $url);
    
    $html = $response->getBody();
    

    We configure Guzzle with the User Agent header and fetch the page HTML.

    Parsing the HTML

    Use DomCrawler to parse the HTML:

    $crawler = new Crawler($html);
    

    This creates a Crawler instance with the document structure.

    Extracting Property Cards

    The property cards have a data-testid attribute of property-card:

    $cards = $crawler->filter('div[data-testid="property-card"]');
    

    This extracts all divs with that attribute into a Crawler collection.

    Looping Through Cards

    Loop through the cards:

    foreach ($cards as $card) {
    
      // Extract data from $card
    
    }
    

    Inside the loop we can extract information from each $card node.

    Extracting Property Name

    The title is in a h3 element:

    $title = $card->filter('h3')->text();
    

    Get the h3 element from card and extract its text.

    Extracting Location

    The location is in a span:

    $location = $card->filter('span[data-testid="address"]')->text();
    

    Filter by the data-testid attribute to find the span.

    Extracting Rating

    Get the aria-label attribute of the star rating div:

    $rating = $card->filter('div.e4755bbd60')->attr('aria-label');
    

    Filter by the CSS class name.

    Extracting Review Count

    Get text of the review count div:

    $reviewCount = $card->filter('div.abf093bdfe')->text();
    

    Again filter by class name.

    Extracting Description

    Get the description div text:

    $description = $card->filter('div.d7449d770c')->text();
    

    Printing the Data

    Print out the extracted information:

    echo "Name: $title\\n";
    echo "Location: $location\\n";
    echo "Rating: $rating\\n";
    echo "Review Count: $reviewCount\\n";
    echo "Description: $description\\n\\n";
    

    This prints the key details for each property listing card.

    You can also store the data in an array instead of printing.

    Full Script

    Here is the full scraping script:

    <?php
    
    require __DIR__ . '/vendor/autoload.php';
    
    use GuzzleHttp\\Client;
    use Symfony\\Component\\DomCrawler\\Crawler;
    
    $url = '<https://www.booking.com/searchresults.en-gb.html?ss=New+York&checkin=2023-03-01&checkout=2023-03-05&group_adults=2>';
    
    $userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36';
    
    $client = new Client(['headers' => ['User-Agent' => $userAgent]]);
    
    $response = $client->request('GET', $url);
    
    $html = $response->getBody();
    
    $crawler = new Crawler($html);
    
    $cards = $crawler->filter('div[data-testid="property-card"]');
    
    foreach ($cards as $card) {
    
      $title = $card->filter('h3')->text();
    
      $location = $card->filter('span[data-testid="address"]')->text();
    
      $rating = $card->filter('div.e4755bbd60')->attr('aria-label');
    
      $reviewCount = $card->filter('div.abf093bdfe')->text();
    
      $description = $card->filter('div.d7449d770c')->text();
    
      echo "Name: $title\\n";
      echo "Location: $location\\n";
      echo "Rating: $rating\\n";
      echo "Review Count: $reviewCount\\n";
      echo "Description: $description\\n\\n";
    
    }
    
    

    This script scrapes and prints key details from Booking.com property listings using PHP and common libraries like Guzzle and DomCrawler. The same technique can be applied to any site.

    While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

    Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

    This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

    With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: