Web Scraping Google Scholar in Scala

In this article, we'll walk through a full code example of scraping search results data from Google Scholar using Scala and the Jsoup library.

Even as a beginner, by the end of this article you'll understand:

How to connect to a webpage and get its raw HTML content

How to use CSS selectors to extract elements of interest

How to loop through multiple search results to extract information

This will provide a foundation for building your own web scrapers to gather data for any purpose.

This is the Google Scholar result page we are talking about…

Setting up the Environment

Because we'll be using external libraries, there is some setup required before running the code:

Install Scala

If you don't already have Scala installed on your machine, you'll need to:

Download Scala from https://www.scala-lang.org/download/
Follow the installation instructions for your operating system

Get the Jsoup Dependency

We use the Jsoup Java library to connect to webpages and parse the HTML content.

You'll need to add this dependency to your Scala project. If using SBT, add this line to build.sbt:

libraryDependencies += "org.jsoup" % "jsoup" % "1.14.3"

If using another build tool, check its documentation for adding external libraries.

Okay, we're ready to dive into the code!

Connecting to Google Scholar

We first need to connect to Google Scholar to get the raw HTML content of the search page:

// Define the URL of the Google Scholar search page
val url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"

// Send a GET request to the URL
val doc: Document = Jsoup.connect(url)
  .userAgent("Mozilla/5.0...")
  .get()

Here's what's happening in detail:

We define the complete URL of the Google Scholar search results page for the query "transformers". This is our target page to scrape.

Using Jsoup.connect(), we open a connection to that URL.

We set a valid userAgent string to mimic a real web browser. This prevents getting blocked.

The .get() makes the actual GET request, downloads the page content, and parses the HTML into a Document object that we can then query.

So after those lines run, the doc variable contains the entire HTML content of the Google Scholar search page, ready for extraction!

Note: In web scraping, using a timeout and retry logic is also important in case of errors. We omitted that here for simplicity.

Extracting Elements from the Page

Now that we have the page content, we can use CSS selectors to extract specific elements from the HTML.

Understanding CSS Selectors

CSS selectors allow locating elements in the DOM tree based on class names, IDs, hierarchy, attributes and more.

Some examples:

div.results - Finds all

tags with class results

span#title - Finds the tag with id title

a.link[href="/page"] - Finds anchor tags of class link whose href attribute equals /page

We can use these to extract specific pieces of data.

Selecting Google Scholar Results

Inspecting the code

You can see that the items are enclosed in a

element with the class gs_ri

In our code, the main search results on Google Scholar have CSS class gs_ri. We first select all of them:

val searchResults: Elements = doc.select("div.gs_ri")

This gives us a collection of Element objects representing each individual search result.

We can now loop through them to extract info from each result:

for (result: Element <- searchResults.toArray) {

  // Extract data from this search result

  ...

}

Let's look at how each piece of data is selected:

Extracting the Title

// Select the <h3> tag under this result
val titleElem: Element = result.selectFirst("h3.gs_rt")

// Get the text contents of that h3 element
val title: String = if(titleElem != null) titleElem.text() else "N/A"

We first use the selector h3.gs_rt to find the

element that contains the title text.
From that element, we get the .text() which extracts just the text content.

Extracting the URL

// Select the <a> tag under h3
val url: String = if(titleElem != null) titleElem.selectFirst("a").attr("href") else "N/A"

Here we get the anchor () tag under the

element. Then we extract the href attribute which contains the article's URL.

And so on for authors, publication details, abstract, etc - each field has a specific selector to extract it.

Key Insight: Understanding which CSS selector corresponds to which data field is crucial for successful scraping. We spent time here understanding them since that's where beginners tend to struggle.

Putting it Together

We loop through the results, applying the above element selection logic to print out key fields:

for (result: Element <- searchResults.toArray) {

  // Select h3 tag
  val titleElem = ...

  // Extract title
  val title = ...

  // Extract URL
  val url = ...

  // Print output
  println("Title: " + title)
  println("URL: " + url)

}

And at the end we have a structured dataset with the information we wanted per search result!

The full code can be seen below:

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element
import org.jsoup.select.Elements

object ScholarScraper {
  def main(args: Array[String]): Unit = {
    // Define the URL of the Google Scholar search page
    val url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="

    // Send a GET request to the URL
    val doc: Document = Jsoup.connect(url)
      .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36")
      .get()

    // Find all the search result blocks with class "gs_ri"
    val searchResults: Elements = doc.select("div.gs_ri")

    // Loop through each search result block and extract information
    for (result: Element <- searchResults.toArray) {
      // Extract the title and URL
      val titleElem: Element = result.selectFirst("h3.gs_rt")
      val title: String = if (titleElem != null) titleElem.text() else "N/A"
      val url: String = if (titleElem != null) titleElem.selectFirst("a").attr("href") else "N/A"

      // Extract the authors and publication details
      val authorsElem: Element = result.selectFirst("div.gs_a")
      val authors: String = if (authorsElem != null) authorsElem.text() else "N/A"

      // Extract the abstract or description
      val abstractElem: Element = result.selectFirst("div.gs_rs")
      val abstract: String = if (abstractElem != null) abstractElem.text() else "N/A"

      // Print the extracted information
      println("Title: " + title)
      println("URL: " + url)
      println("Authors: " + authors)
      println("Abstract: " + abstract)
      println("-" * 50)  // Separating search results
    }
  }
}

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Web Scraping Google Scholar in Scala

Setting up the Environment

Install Scala

Get the Jsoup Dependency

Connecting to Google Scholar

Extracting Elements from the Page

Understanding CSS Selectors

Selecting Google Scholar Results

Extracting the Title

element that contains the title text.
From that element, we get the .text() which extracts just the text content.

Extracting the URL

Putting it Together

Browse by language:

The easiest way to do Web Scraping

Web Scraping Google Scholar in Scala

Setting up the Environment

Install Scala

Get the Jsoup Dependency

Connecting to Google Scholar

Extracting Elements from the Page

Understanding CSS Selectors

Selecting Google Scholar Results

Extracting the Title

element that contains the title text.From that element, we get the .text() which extracts just the text content.

Extracting the URL

Putting it Together

The easiest way to do Web Scraping

Don't leave just yet!

element that contains the title text.
From that element, we get the .text() which extracts just the text content.