Web Scraping Google Scholar in Java

Jan 21, 2024 · 7 min read

Web scraping is a technique for automatically extracting information from websites. In this comprehensive tutorial, we'll walk through an example Java program that scrapes search results data from Google Scholar.

This is the Google Scholar result page we are talking about…

Specifcally, we'll learn how to use the popular Jsoup Java library to connect to Google Scholar, send search queries, and scrape key bits of data - title, URL, authors, and abstract text - from the search results pages.

Prerequisites

To follow along with the code examples below, you'll need:

  • Java and a text editor / IDE set up for writing and running Java code
  • Jsoup library added to your Java project. Jsoup handles connecting to web pages and selecting page elements. More details on getting setup with Jsoup here: https://jsoup.org/download
  • That's it! Jsoup handles most of the heavy lifting, so we can focus on the fun data extraction parts.

    Walkthrough of the Web Scraper Code

    Let's break it down section by section.

    Imports

    We import Jsoup classes that allow connecting to web pages and selecting elements:

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    

    Define URL and User-Agent

    Next we define the Google Scholar URL we want to scrape along with a common User-Agent header:

    // Define the URL of the Google Scholar search page
    String url = "<https://scholar.google.com/scholar?hl=en&as\\_sdt=0%2C5&q=transformers&btnG=>";
    
    // Define a User-Agent header
    String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";
    

    Quick web scraping tip - impersonating a real browser's User-Agent helps avoid bot detection.

    Connect to URL and Select Elements

    Inspecting the code

    You can see that the items are enclosed in a

    element with the class gs_ri

    The magic happens in this section where we:

    1. Use Jsoup to connect to the Google Scholar URL
    2. Select all search result elements on the page with select("div.gs_ri")
    // Send a GET request to the URL with the User-Agent header
    Document document = Jsoup.connect(url).userAgent(userAgent).get();
    
    // Find all the search result blocks with class "gs_ri"
    Elements searchResults = document.select("div.gs_ri");
    

    Let's break this down...

    The Jsoup connect() method reaches out to the web page and downloads the HTML content. We pass our URL and User-Agent to avoid bot checks.

    This HTML is stored in a Document variable that we query to extract data.

    The document.select() line is where we select elements from the page. Here we target search result

    tags with CSS class gs_ri using this selector syntax:

    div.gs_ri
    

    All matching elements get stored in an Elements collection that we can now loop through.

    Pro tip: Install browser developer tools to inspect elements and test selectors.

    Extract Data from Search Results

    With search result elements selected, we can traverse each one and extract the inner text and attributes:

    // Loop through each search result block and extract information
    for (Element result : searchResults) {
    
      // Extract the title and URL
      Element titleElement = result.selectFirst("h3.gs_rt");
      String title = titleElement != null ? titleElement.text() : "N/A";
      String resultUrl = titleElement != null ? titleElement.selectFirst("a").attr("href") : "N/A";
    
      // Extract the authors and publication details
      Element authorsElement = result.selectFirst("div.gs_a");
      String authors = authorsElement != null ? authorsElement.text() : "N/A";
    
      // Extract the abstract or description
      Element abstractElement = result.selectFirst("div.gs_rs");
      String abstractText = abstractElement != null ? abstractElement.text() : "N/A";
    
      // Print the extracted information
      System.out.println("Title: " + title);
      System.out.println("URL: " + resultUrl);
      System.out.println("Authors: " + authors);
      System.out.println("Abstract: " + abstractText);
      System.out.println("-".repeat(50)); // Separating search results
    }
    

    We loop through each previously selected

    element, and extract data by targeting specific child tags:

  • Title - Select the

    tag with class gs_rt, get .text()

  • URL - Get anchor tag within title element, get href attribute
  • Authors - Select
    with class gs_a, get .text()
  • Abstract - Select
    with class gs_rs, get .text()
  • The scraped pieces of data are printed, with each search result separated by dashes.

    And that's it! The full code connects to Google Scholar, scrapes results, and extracts key pieces of data from each one.

    Let's quickly summarize the key concepts:

  • Use Jsoup to connect to web pages - Handling sessions, cookies, headers
  • Select elements with CSS-style selectors
  • Extract data - text, attributes, HTML
  • Loop through many elements for mass data scraping
  • This core scraper recipe can be adapted to pull data from almost any site.

    Full Java Code for Scraping Google Scholar

    Here is the complete code example for scraping search results data from Google Scholar:

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    import java.io.IOException;
    
    public class GoogleScholarScraper {
    
        public static void main(String[] args) {
            // Define the URL of the Google Scholar search page
            String url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";
    
            // Define a User-Agent header
            String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36";
    
            try {
                // Send a GET request to the URL with the User-Agent header
                Document document = Jsoup.connect(url).userAgent(userAgent).get();
    
                // Find all the search result blocks with class "gs_ri"
                Elements searchResults = document.select("div.gs_ri");
    
                // Loop through each search result block and extract information
                for (Element result : searchResults) {
                    // Extract the title and URL
                    Element titleElement = result.selectFirst("h3.gs_rt");
                    String title = titleElement != null ? titleElement.text() : "N/A";
                    String resultUrl = titleElement != null ? titleElement.selectFirst("a").attr("href") : "N/A";
    
                    // Extract the authors and publication details
                    Element authorsElement = result.selectFirst("div.gs_a");
                    String authors = authorsElement != null ? authorsElement.text() : "N/A";
    
                    // Extract the abstract or description
                    Element abstractElement = result.selectFirst("div.gs_rs");
                    String abstractText = abstractElement != null ? abstractElement.text() : "N/A";
    
                    // Print the extracted information
                    System.out.println("Title: " + title);
                    System.out.println("URL: " + resultUrl);
                    System.out.println("Authors: " + authors);
                    System.out.println("Abstract: " + abstractText);
                    System.out.println("-".repeat(50)); // Separating search results
                }
            } catch (IOException e) {
                System.err.println("Failed to retrieve the page. Error: " + e.getMessage());
            }
        }
    }

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: