Scraping Yelp Business Listings in Go

Dec 6, 2023 ยท 7 min read

Web scraping refers to the automated extraction of data from websites. In this guide, we'll walk through an example of scraping business listing data from Yelp to perform further analysis.

This is the page we are talking about

Use Case

Why would someone want to scrape Yelp? Here are some examples of what you can do with the scraped data:

  • Analyze ratings, reviews, and price ranges for competitive analysis
  • Build a dataset of business info like names, locations, categories
  • Combine with other data sources for deeper insights into consumer behavior
  • The code we will go through scrapes key details like business name, rating, price range, number of reviews, and location for each listing on a Yelp search. Let's dive in!

    The Code

    We will break down this Go program section-by-section to understand how it works under the hood:

    // Imports packages needed
    import (
      "fmt"
      "io/ioutil"
      "net/http"
      "net/url"
      "os"
      "strings"
      "github.com/PuerkitoBio/goquery"
    )
    

    First we import all the necessary packages that will be used:

  • net/http and io/ioutil: Making HTTP requests and handling responses
  • net/url: Encoding the Yelp URL
  • strings: Manipulating strings
  • github.com/PuerkitoBio/goquery: Querying and parsing HTML
  • Constructing the URLs

    // Yelp URL to scrape
    yelpURL := "<https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA>"
    
    // URL-encode the string
    encodedURL := url.QueryEscape(yelpURL)
    
    // ProxiesAPI URL
    apiURL := "<http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=>" + encodedURL
    

    We begin by defining the Yelp URL that we want to scrape data from with a search query.

    The URL is then encoded properly to be handled by the proxy service API. Proxies are needed to bypass Yelp's bot detection and scraping restrictions.

    NOTE: You would need to sign up for a proxy service like ProxiesAPI to obtain an auth key. Proxy rotation is necessary for stable scraping of sites like Yelp.

    Setting Request Headers

    // User-Agent header
    headers := map[string]string{
      "User-Agent": "Mozilla/5.0...",
      "Accept-Language": "en-US,en;q=0.5",
      "Accept-Encoding": "gzip, deflate, br",
      "Referer": "<https://www.google.com/>",
    }
    
    // Create HTTP client
    client := &http.Client{}
    
    // Build GET request
    req, err := http.NewRequest("GET", apiURL, nil)
    
    // Add headers
    for key, value := range headers {
      req.Header.Set(key, value)
    }
    

    We set headers like User-Agent to mimic a normal browser visit, reducing chances of blocking.

    The HTTP client and GET request are created to go through the ProxiesAPI service.

    Making the Request

    // Send request
    resp, err := client.Do(req)
    
    // Handle errors
    if err != nil {
      panic(err)
    }
    
    // Close response body
    defer resp.Body.Close()
    

    We send the request and make sure to close the response body when done.

    Processing the Response

    Inspecting the page

    When we inspect the page we can see that the div has classes called arrange-unit__09f24__rqHTg arrange-unit-fill__09f24__CUubG css-1qn0b6x

    // Read response body
    body, err := ioutil.ReadAll(resp.Body)
    
    // Write HTML to file
    err = ioutil.WriteFile("yelp_html.html", body, 0644)
    
    // 200 OK status?
    if resp.StatusCode == 200 {
    
      // Parse HTML
      doc, err := goquery.NewDocumentFromReader(strings.NewReader(string(body)))
    
      // Find listings
      listings := doc.Find("div.css-1qn0b6x")
    

    The HTML response content is read and saved to a file for parsing.

    We check that the status code returned is 200 OK before proceeding to extract data.

    This is where the key action happens - selecting elements from the HTML document using goquery!

    // Loop through listings
    listings.Each(func(index int, item *goquery.Selection) {
    
      // Extract name
      nameSel := item.Find("a.css-19v1rkv")
      businessName := nameSel.Text()
    
      // Extract rating
      ratingSel := item.Find("span.css-gutk1c")
      rating := ratingSel.Text()
    
      // Extract price range
      priceRangeSel := item.Find("priceRange__09f24__mmOuH")
      priceRange := priceRangeSel.Text()
    
    	// Extracting number of reviews and location
    				numReviews := "N/A"
    				location := "N/A"
    				spanElements := item.Find("span.css-chan6m")
    				if spanElements.Length() >= 2 {
    					numReviews = spanElements.Eq(0).Text()
    					location = spanElements.Eq(1).Text()
    				} else if spanElements.Length() == 1 {
    					text := spanElements.Eq(0).Text()
    					if _, err := strconv.Atoi(text); err == nil {
    						numReviews = text
    					} else {
    						location = text
    					}
    				}
    
      // Print data
      fmt.Println("Name:", businessName)
      fmt.Println("Rating:", rating)
      fmt.Println("Number of Reviews:", numReviews)
      fmt.Println("Price Range:", priceRange)
      fmt.Println("Location:", location)
    })
    

    For each business listing, we use CSS selectors to find and extract specific pieces of data:

  • a.css-19v1rkv to select the business name anchor tag
  • span.css-gutk1c to select the rating
  • priceRange__09f24__mmOuH for price range
  • And so on...
  • The key things that can trip beginners up are:

  • Understanding that CSS selectors target HTML elements to extract data from
  • Figuring out the right selector combinations to zero in on the data needed
  • Handling variability in the DOM structure (hence try/catch checks done)
  • With practice, you build knowledge and intuition for writing robust scrapers!

    Finally, we print out the extracted data from each listing.

    And that's it! By going through the documentation and this code walkthrough, you should have a solid grasp of the fundamentals of web scraping Yelp listings using Go.

    Some challenges you may face:

  • Dealing with captchas and blocks from sending too many requests
  • Changes in selectors breaking the scraper
  • Expanding to scrape additional data fields as needed
  • But the core concepts remain the same. Feel free to build on this starter scraper for your own projects!

    Here is the full code:

    package main
    
    import (
    	"fmt"
    	"io/ioutil"
    	"net/http"
    	"net/url"
    	"os"
    	"strings"
    
    	"github.com/PuerkitoBio/goquery"
    )
    
    func main() {
    	// Yelp URL
    	yelpURL := "https://www.yelp.com/search?find_desc=chinese&find_loc=San+Francisco%2C+CA"
    
    	// URL-encode the Yelp URL
    	encodedURL := url.QueryEscape(yelpURL)
    
    	// API URL
    	apiURL := "http://api.proxiesapi.com/?premium=true&auth_key=YOUR_AUTH_KEY&url=" + encodedURL
    
    	// User-Agent header
    	headers := map[string]string{
    		"User-Agent":      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
    		"Accept-Language": "en-US,en;q=0.5",
    		"Accept-Encoding": "gzip, deflate, br",
    		"Referer":         "https://www.google.com/",
    	}
    
    	// Create HTTP client and request
    	client := &http.Client{}
    	req, err := http.NewRequest("GET", apiURL, nil)
    	if err != nil {
    		panic(err)
    	}
    
    	// Add headers to the request
    	for key, value := range headers {
    		req.Header.Set(key, value)
    	}
    
    	// Perform the HTTP GET request
    	resp, err := client.Do(req)
    	if err != nil {
    		panic(err)
    	}
    	defer resp.Body.Close()
    
    	// Read the response body
    	body, err := ioutil.ReadAll(resp.Body)
    	if err != nil {
    		panic(err)
    	}
    
    	// Write response to file
    	err = ioutil.WriteFile("yelp_html.html", body, 0644)
    	if err != nil {
    		panic(err)
    	}
    
    	// Check if the request was successful
    	if resp.StatusCode == 200 {
    		// Load the HTML document
    		doc, err := goquery.NewDocumentFromReader(strings.NewReader(string(body)))
    		if err != nil {
    			panic(err)
    		}
    
    		// Find all listings
    		listings := doc.Find("div.arrange-unit__09f24__rqHTg.arrange-unit-fill__09f24__CUubG.css-1qn0b6x")
    		fmt.Println("Listings found:", listings.Length())
    
    		// Loop through each listing
    		listings.Each(func(index int, item *goquery.Selection) {
    			// Extracting business name
    			businessName := "N/A"
    			if nameSel := item.Find("a.css-19v1rkv"); nameSel.Length() > 0 {
    				businessName = nameSel.Text()
    			}
    
    			// Extracting rating
    			rating := "N/A"
    			if ratingSel := item.Find("span.css-gutk1c"); ratingSel.Length() > 0 {
    				rating = ratingSel.Text()
    			}
    
    			// Extracting price range
    			priceRange := "N/A"
    			if priceRangeSel := item.Find("span.priceRange__09f24__mmOuH"); priceRangeSel.Length() > 0 {
    				priceRange = priceRangeSel.Text()
    			}
    
    			// Extracting number of reviews and location
    			numReviews := "N/A"
    			location := "N/A"
    			spanElements := item.Find("span.css-chan6m")
    			if spanElements.Length() >= 2 {
    				numReviews = spanElements.Eq(0).Text()
    				location = spanElements.Eq(1).Text()
    			} else if spanElements.Length() == 1 {
    				text := spanElements.Eq(0).Text()
    				if _, err := strconv.Atoi(text); err == nil {
    					numReviews = text
    				} else {
    					location = text
    				}
    			}
    
    			// Print the extracted information
    			fmt.Println("Business Name:", businessName)
    			fmt.Println("Rating:", rating)
    			fmt.Println("Number of Reviews:", numReviews)
    			fmt.Println("Price Range:", priceRange)
    			fmt.Println("Location:", location)
    			fmt.Println(strings.Repeat("=", 30))
    		})
    	} else {
    		fmt.Printf("Failed to retrieve data. Status Code: %d\n", resp.StatusCode)
    	}
    }

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: