Scraping Real Estate Listings From Realtor in Scala

Jan 9, 2024 · 6 min read

While debates around the ethics of web scraping continue, the practice remains a useful way for developers to extract data from websites. In this beginner-focused tutorial, we'll walk through a full code example for scraping key details from real estate listings on Realtor.com using a Java library called Jsoup.

This is the listings page we are talking about…

Getting Set Up

Before we dive into the code, you'll need to install Jsoup if you don't already have it. You can add this dependency in your project's build tool, such as Gradle or Maven. Here's an example for Gradle:

implementation 'org.jsoup:jsoup:1.14.3'

And for Maven:

<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.14.3</version>
</dependency>

Now we're ready to scrape!

Connecting to the Page

Let's explore what's happening section-by-section:

// Define the URL of the Realtor.com search page
val url = "<https://www.realtor.com/realestateandhomes-search/San-Francisco_CA>"

We specify the exact Realtor URL that we want to scrape. This will contain the listings for San Francisco when visited in a browser.

// Set the User-Agent header
val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"

Next we set the User-Agent header to mimic a Chrome browser visit. Many sites check this header to determine if the visitor is a real browser or an automated program.

// Fetch the HTML content of the page
val doc: Document = Jsoup.connect(url).userAgent(userAgent).get()

We use Jsoup to connect to the Realtor URL, passing in that User-Agent string we set. Jsoup downloads (or "fetches") the full HTML content from that page and stores it for us to work with in the doc variable.

Tip: I find it helpful to think of this like browsing to the page and doing "View Source" to see all the underlying HTML. Jsoup handles that part for us programmatically.

Now let's move on to the most critical part - actually extracting information from that HTML using CSS selectors!

Extracting Data with Selectors

Inspecting the element

When we inspect element in Chrome we can see that each of the listing blocks is wrapped in a div with a class value as shown below…

// Find all the listing blocks using the provided class name
val listingBlocks: Seq[Element] = doc.select("div.BasePropertyCard_propertyCardWrap__J0xUj")

The main listings on Realtor.com are contained in div tags with the BasePropertyCard_propertyCardWrap__J0xUj class name.

We use Jsoup's .select() method to find all divs matching that class name on the page. This returns a list of Element objects representing each individual listing block. Think of it like matching LEGO bricks - we'll traverse the page and select all bricks of a certain shape.

Tip: You can discover class names to target by inspecting elements in your browser's dev tools. The styles and classes applied to each element are visible there.

Now we can iterate through each listing:

for (listingBlock <- listingBlocks) {

  // Extract data from each listingBlock

}

And inside that loop, we use additional selectors to pull text from specific tags:

val brokerName: String = listingBlock.selectFirst("span.BrokerTitle_titleText__20u1P").text()

This selector finds the span tag with class BrokerTitle_titleText__20u1P inside the current listing block. By calling .text() on the returned element, we extract just the text "John Smith" or "ABC Realty".

Let's break down that selector:

  • listingBlock - The "scope" - search only within the current listing block
  • span.BrokerTitle_titleText__20u1P - Find a span with this class name
  • .text() - Extract the text contents of the matching element
  • We use very similar selectors to extract other data points like status, price, beds, baths etc.:

    val status: String = listingBlock.selectFirst("div.message").text()
    
    val price: String = listingBlock.selectFirst("div.card-price").text()
    
    // And so on...
    

    Notice how on some we look for a div with a certain class, others we query by data attribute - like li[data-testid=property-meta-sqft]. HTML pages and CSS styling vary, so part of the skill is adapting the selectors to each site.

    Here a few key advantages of using selectors:

  • We avoid brittle techniques like finding elements by text content. Styling and text can change easily. Classes and IDs are more reliable identifiers.
  • Selectors mirror how developers inspect and identify elements in the browser dev tools during manual testing.
  • They provide a flexible way to specify elements - we can combine class names, attributes, element types and more to uniquely identify each data field we need.
  • Now you have a high-level look at how Jsoup connects to pages and uses CSS selector queries to extract key listings details! Let's look at the full code for reference...

    Full Code

    import org.jsoup.Jsoup
    import org.jsoup.nodes.Document
    import org.jsoup.nodes.Element
    
    object RealtorScraper {
      def main(args: Array[String]): Unit = {
        // Define the URL of the Realtor.com search page
        val url = "https://www.realtor.com/realestateandhomes-search/San-Francisco_CA"
    
        // Set the User-Agent header
        val userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
    
        try {
          // Fetch the HTML content of the page
          val doc: Document = Jsoup.connect(url).userAgent(userAgent).get()
    
          // Find all the listing blocks using the provided class name
          val listingBlocks: Seq[Element] = doc.select("div.BasePropertyCard_propertyCardWrap__J0xUj")
    
          // Loop through each listing block and extract information
          for (listingBlock <- listingBlocks) {
            // Extract the broker information
            val brokerInfo: Element = listingBlock.selectFirst("div.BrokerTitle_brokerTitle__ZkbBW")
            val brokerName: String = brokerInfo.selectFirst("span.BrokerTitle_titleText__20u1P").text()
    
            // Extract the status (e.g., For Sale)
            val status: String = listingBlock.selectFirst("div.message").text()
    
            // Extract the price
            val price: String = listingBlock.selectFirst("div.card-price").text()
    
            // Extract other details like beds, baths, sqft, and lot size
            val beds: String = listingBlock.selectFirst("li[data-testid=property-meta-beds]").text()
            val baths: String = listingBlock.selectFirst("li[data-testid=property-meta-baths]").text()
            val sqft: String = listingBlock.selectFirst("li[data-testid=property-meta-sqft]").text()
            val lotSize: String = listingBlock.selectFirst("li[data-testid=property-meta-lot-size]").text()
    
            // Extract the address
            val address: String = listingBlock.selectFirst("div.card-address").text()
    
            // Print the extracted information
            println(s"Broker: $brokerName")
            println(s"Status: $status")
            println(s"Price: $price")
            println(s"Beds: $beds")
            println(s"Baths: $baths")
            println(s"Sqft: $sqft")
            println(s"Lot Size: $lotSize")
            println(s"Address: $address")
            println("-" * 50)  // Separating listings
          }
        } catch {
          case e: Exception =>
            println(s"Failed to retrieve the page. Error: ${e.getMessage}")
        }
      }
    }

    As you work on more scrapers, the process will become second-nature - connect, select elements, extract data. But it takes examples like this walkthrough to fully demystify what's happening behind the scenes.

    Hopefully as a beginner you now feel equipped to start writing scrapers using Jsoup and CSS selectors!

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: