Web Scraping Google Scholar in Go

Google Scholar is an excellent source of academic papers and research. In this article, we'll go through code to scrape Google Scholar search results using Go. The code searches for "transformers", then extracts key data like title, URL, authors, and abstract for each search result.

This is the Google Scholar result page we are talking about…

We'll dive deep into how it works - explaining each step clearly for beginners.

Imports

Let's first look at the imports:

import (
    "fmt"
    "log"
    "net/http"
    "strings"
    "github.com/PuerkitoBio/goquery"
)

These provide the key functionality we need:

fmt: Print formatted output

log: Log errors

net/http: Make HTTP requests to web pages

strings: String manipulation

goquery: Parse and scrape HTML pages

Main Function

The main() function handles the key workflow:

func main() {

    // Define Google Scholar URL
    url := "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"

    // Set User-Agent header
    headers := map[string]string{
        "User-Agent": "Mozilla/5.0...",
    }

    // Make GET request
    resp, err := http.Get(url)

    // Check if status code is 200 OK
    if resp.StatusCode == 200 {

        // Parse HTML using goquery
        doc, err := goquery.NewDocumentFromReader(resp.Body)

        // Find search results
        doc.Find(".gs_ri").Each(func(i int, s *goquery.Selection) {

            // Extract data from each search result
            // And print
        })
    }
}

Let's break this down:

Define URL: We define the Google Scholar URL to search for "transformers".
Set User-Agent header: We set a browser User-Agent header to mimic a real user.
Make GET request: We use net/http to make the GET request to the URL.
Check status code: We check if the status code in the response is 200 OK.
Parse HTML with goquery: If status is OK, we parse the HTML content using goquery.
Find search results: We use a CSS selector to find all search result blocks.
Extract data & print: We loop through each search result block to extract and print data.

Next we'll dive deep into extracting data from each search result.

Scraping Each Search Result

Inspecting the code

You can see that the items are enclosed in a

element with the class gs_ri

The key part is using goquery to find search result elements on the page and extract data:

doc.Find(".gs_ri").Each(func(i int, s *goquery.Selection) {

    // Extract title, URL, authors, abstract for each result

})

Let's understand this:

CSS Selectors

We use the .Find() function along with a CSS selector to locate elements.

A CSS selector is like an address that uniquely identifies elements on a web page.

Here we use selector .gs_ri to find all elements with class="gs_ri". This matches the search result containers.

Looping Through Results

The .Each() function loops through all matching elements, so we can extract data from each search result.

We get a pointer s to the current search result element.

Extracting Title & URL

Let's see how the title and URL are extracted:

titleElem := s.Find("h3.gs_rt")
title := titleElem.Text()

url, _ := titleElem.Find("a").Attr("href")

We first find the

element with class gs_rt inside current search result. This contains title and link.

We use .Text() function to extract just the text within - i.e. the paper's title.

For URL, we dig into the tag and get the href attribute using .Attr() function.

So it selects:

<h3 class="gs_rt">
    <a href="<http://url-to-paper.com>">
        This is the paper title
    </a>
</h3>

And extracts the title text and URL separately.

Authors & Abstract

authorsElem := s.Find("div.gs_a")
authors := authorsElem.Text()

And the abstract using .gs_rs selector:

abstractElem := s.Find("div.gs_rs")
abstract := abstractElem.Text()

So goquery allows easily drilling down and extracting any data from elements.

Printing Output

Finally, we print out all the extracted info - title, URL, authors, abstract:

This gives us nicely structured data for each search result.

The process then repeats for every result on the page by looping through .Each()

Full Code

For easy reference, here is the complete code:

package main

import (
	"fmt"
	"log"
	"net/http"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	// Define the URL of the Google Scholar search page
	url := "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="

	// Define a User-Agent header
	headers := map[string]string{
		"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
	}

	// Send a GET request to the URL with the User-Agent header
	client := &http.Client{}
	req, err := http.NewRequest("GET", url, nil)
	if err != nil {
		log.Fatal(err)
	}
	for key, value := range headers {
		req.Header.Add(key, value)
	}

	// Check if the request was successful (status code 200)
	resp, err := client.Do(req)
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	if resp.StatusCode == 200 {
		// Parse the HTML content of the page using goquery
		doc, err := goquery.NewDocumentFromReader(resp.Body)
		if err != nil {
			log.Fatal(err)
		}

		// Find all the search result blocks with class "gs_ri"
		doc.Find(".gs_ri").Each(func(i int, s *goquery.Selection) {
			// Extract the title and URL
			titleElem := s.Find("h3.gs_rt")
			title := titleElem.Text()
			url, _ := titleElem.Find("a").Attr("href")

			// Extract the authors and publication details
			authorsElem := s.Find("div.gs_a")
			authors := authorsElem.Text()

			// Extract the abstract or description
			abstractElem := s.Find("div.gs_rs")
			abstract := abstractElem.Text()

			// Print the extracted information
			fmt.Println("Title:", title)
			fmt.Println("URL:", url)
			fmt.Println("Authors:", authors)
			fmt.Println("Abstract:", abstract)
			fmt.Println(strings.Repeat("-", 50)) // Separating search results
		})
	} else {
		fmt.Println("Failed to retrieve the page. Status code:", resp.StatusCode)
	}
}

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

With millions of high speed rotating proxies located all over the world,

With our automatic IP rotation

With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)

With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"

We have a running offer of 1000 API calls completely free. Register and get your free API Key.

Web Scraping Google Scholar in Go

Imports

Main Function

Scraping Each Search Result

CSS Selectors

Looping Through Results

Extracting Title & URL

Authors & Abstract

Printing Output

Full Code

Browse by language:

The easiest way to do Web Scraping

Web Scraping Google Scholar in Go

Imports

Main Function

Scraping Each Search Result

CSS Selectors

Looping Through Results

Extracting Title & URL

Authors & Abstract

Printing Output

Full Code

The easiest way to do Web Scraping

Don't leave just yet!