Web Scraping Google Scholar in Scala

Jan 21, 2024 · 7 min read

In this article, we'll walk through a full code example of scraping search results data from Google Scholar using Scala and the Jsoup library.

Even as a beginner, by the end of this article you'll understand:

  • How to connect to a webpage and get its raw HTML content
  • How to use CSS selectors to extract elements of interest
  • How to loop through multiple search results to extract information
  • This will provide a foundation for building your own web scrapers to gather data for any purpose.

    This is the Google Scholar result page we are talking about…

    Setting up the Environment

    Because we'll be using external libraries, there is some setup required before running the code:

    Install Scala

    If you don't already have Scala installed on your machine, you'll need to:

    1. Download Scala from https://www.scala-lang.org/download/
    2. Follow the installation instructions for your operating system

    Get the Jsoup Dependency

    We use the Jsoup Java library to connect to webpages and parse the HTML content.

    You'll need to add this dependency to your Scala project. If using SBT, add this line to build.sbt:

    libraryDependencies += "org.jsoup" % "jsoup" % "1.14.3"
    

    If using another build tool, check its documentation for adding external libraries.

    Okay, we're ready to dive into the code!

    Connecting to Google Scholar

    We first need to connect to Google Scholar to get the raw HTML content of the search page:

    // Define the URL of the Google Scholar search page
    val url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"
    
    // Send a GET request to the URL
    val doc: Document = Jsoup.connect(url)
      .userAgent("Mozilla/5.0...")
      .get()
    

    Here's what's happening in detail:

  • We define the complete URL of the Google Scholar search results page for the query "transformers". This is our target page to scrape.
  • Using Jsoup.connect(), we open a connection to that URL.
  • We set a valid userAgent string to mimic a real web browser. This prevents getting blocked.
  • The .get() makes the actual GET request, downloads the page content, and parses the HTML into a Document object that we can then query.
  • So after those lines run, the doc variable contains the entire HTML content of the Google Scholar search page, ready for extraction!

    Note: In web scraping, using a timeout and retry logic is also important in case of errors. We omitted that here for simplicity.

    Extracting Elements from the Page

    Now that we have the page content, we can use CSS selectors to extract specific elements from the HTML.

    Understanding CSS Selectors

    CSS selectors allow locating elements in the DOM tree based on class names, IDs, hierarchy, attributes and more.

    Some examples:

  • div.results - Finds all
    tags with class results
  • span#title - Finds the tag with id title
  • a.link[href="/page"] - Finds anchor tags of class link whose href attribute equals /page
  • We can use these to extract specific pieces of data.

    Selecting Google Scholar Results

    Inspecting the code

    You can see that the items are enclosed in a

    element with the class gs_ri

    In our code, the main search results on Google Scholar have CSS class gs_ri. We first select all of them:

    val searchResults: Elements = doc.select("div.gs_ri")
    

    This gives us a collection of Element objects representing each individual search result.

    We can now loop through them to extract info from each result:

    for (result: Element <- searchResults.toArray) {
    
      // Extract data from this search result
    
      ...
    
    }
    

    Let's look at how each piece of data is selected:

    Extracting the Title

    // Select the <h3> tag under this result
    val titleElem: Element = result.selectFirst("h3.gs_rt")
    
    // Get the text contents of that h3 element
    val title: String = if(titleElem != null) titleElem.text() else "N/A"
    

    We first use the selector h3.gs_rt to find the

    element that contains the title text.

    From that element, we get the .text() which extracts just the text content.

    Extracting the URL

    // Select the <a> tag under h3
    val url: String = if(titleElem != null) titleElem.selectFirst("a").attr("href") else "N/A"
    

    Here we get the anchor () tag under the

    element. Then we extract the href attribute which contains the article's URL.

    And so on for authors, publication details, abstract, etc - each field has a specific selector to extract it.

    Key Insight: Understanding which CSS selector corresponds to which data field is crucial for successful scraping. We spent time here understanding them since that's where beginners tend to struggle.

    Putting it Together

    We loop through the results, applying the above element selection logic to print out key fields:

    for (result: Element <- searchResults.toArray) {
    
      // Select h3 tag
      val titleElem = ...
    
      // Extract title
      val title = ...
    
      // Extract URL
      val url = ...
    
      // Print output
      println("Title: " + title)
      println("URL: " + url)
    
    }
    

    And at the end we have a structured dataset with the information we wanted per search result!

    The full code can be seen below:

    import org.jsoup.Jsoup
    import org.jsoup.nodes.Document
    import org.jsoup.nodes.Element
    import org.jsoup.select.Elements
    
    object ScholarScraper {
      def main(args: Array[String]): Unit = {
        // Define the URL of the Google Scholar search page
        val url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="
    
        // Send a GET request to the URL
        val doc: Document = Jsoup.connect(url)
          .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36")
          .get()
    
        // Find all the search result blocks with class "gs_ri"
        val searchResults: Elements = doc.select("div.gs_ri")
    
        // Loop through each search result block and extract information
        for (result: Element <- searchResults.toArray) {
          // Extract the title and URL
          val titleElem: Element = result.selectFirst("h3.gs_rt")
          val title: String = if (titleElem != null) titleElem.text() else "N/A"
          val url: String = if (titleElem != null) titleElem.selectFirst("a").attr("href") else "N/A"
    
          // Extract the authors and publication details
          val authorsElem: Element = result.selectFirst("div.gs_a")
          val authors: String = if (authorsElem != null) authorsElem.text() else "N/A"
    
          // Extract the abstract or description
          val abstractElem: Element = result.selectFirst("div.gs_rs")
          val abstract: String = if (abstractElem != null) abstractElem.text() else "N/A"
    
          // Print the extracted information
          println("Title: " + title)
          println("URL: " + url)
          println("Authors: " + authors)
          println("Abstract: " + abstract)
          println("-" * 50)  // Separating search results
        }
      }
    }

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: