Web Scraping Google Scholar in Go

Jan 21, 2024 · 7 min read

Google Scholar is an excellent source of academic papers and research. In this article, we'll go through code to scrape Google Scholar search results using Go. The code searches for "transformers", then extracts key data like title, URL, authors, and abstract for each search result.

This is the Google Scholar result page we are talking about…

We'll dive deep into how it works - explaining each step clearly for beginners.

Imports

Let's first look at the imports:

import (
    "fmt"
    "log"
    "net/http"
    "strings"
    "github.com/PuerkitoBio/goquery"
)

These provide the key functionality we need:

  • fmt: Print formatted output
  • log: Log errors
  • net/http: Make HTTP requests to web pages
  • strings: String manipulation
  • goquery: Parse and scrape HTML pages
  • Main Function

    The main() function handles the key workflow:

    func main() {
    
        // Define Google Scholar URL
        url := "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>"
    
        // Set User-Agent header
        headers := map[string]string{
            "User-Agent": "Mozilla/5.0...",
        }
    
        // Make GET request
        resp, err := http.Get(url)
    
        // Check if status code is 200 OK
        if resp.StatusCode == 200 {
    
            // Parse HTML using goquery
            doc, err := goquery.NewDocumentFromReader(resp.Body)
    
            // Find search results
            doc.Find(".gs_ri").Each(func(i int, s *goquery.Selection) {
    
                // Extract data from each search result
                // And print
            })
        }
    }
    

    Let's break this down:

    1. Define URL: We define the Google Scholar URL to search for "transformers".
    2. Set User-Agent header: We set a browser User-Agent header to mimic a real user.
    3. Make GET request: We use net/http to make the GET request to the URL.
    4. Check status code: We check if the status code in the response is 200 OK.
    5. Parse HTML with goquery: If status is OK, we parse the HTML content using goquery.
    6. Find search results: We use a CSS selector to find all search result blocks.
    7. Extract data & print: We loop through each search result block to extract and print data.

    Next we'll dive deep into extracting data from each search result.

    Scraping Each Search Result

    Inspecting the code

    You can see that the items are enclosed in a

    element with the class gs_ri

    The key part is using goquery to find search result elements on the page and extract data:

    doc.Find(".gs_ri").Each(func(i int, s *goquery.Selection) {
    
        // Extract title, URL, authors, abstract for each result
    
    })
    

    Let's understand this:

    CSS Selectors

    We use the .Find() function along with a CSS selector to locate elements.

    A CSS selector is like an address that uniquely identifies elements on a web page.

    Here we use selector .gs_ri to find all elements with class="gs_ri". This matches the search result containers.

    Looping Through Results

    The .Each() function loops through all matching elements, so we can extract data from each search result.

    We get a pointer s to the current search result element.

    Extracting Title & URL

    Let's see how the title and URL are extracted:

    titleElem := s.Find("h3.gs_rt")
    title := titleElem.Text()
    
    url, _ := titleElem.Find("a").Attr("href")
    

    We first find the

    element with class gs_rt inside current search result. This contains title and link.

    We use .Text() function to extract just the text within - i.e. the paper's title.

    For URL, we dig into the tag and get the href attribute using .Attr() function.

    So it selects:

    And extracts the title text and URL separately.

    Authors & Abstract

    Similarly, authors are extracted using the .gs_a selector:

    And the abstract using .gs_rs selector:

    So goquery allows easily drilling down and extracting any data from elements.

    Printing Output

    Finally, we print out all the extracted info - title, URL, authors, abstract:

    This gives us nicely structured data for each search result.

    The process then repeats for every result on the page by looping through .Each()

    Full Code

    For easy reference, here is the complete code:

    package main
    
    import (
    	"fmt"
    	"log"
    	"net/http"
    	"strings"
    
    	"github.com/PuerkitoBio/goquery"
    )
    
    func main() {
    	// Define the URL of the Google Scholar search page
    	url := "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG="
    
    	// Define a User-Agent header
    	headers := map[string]string{
    		"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
    	}
    
    	// Send a GET request to the URL with the User-Agent header
    	client := &http.Client{}
    	req, err := http.NewRequest("GET", url, nil)
    	if err != nil {
    		log.Fatal(err)
    	}
    	for key, value := range headers {
    		req.Header.Add(key, value)
    	}
    
    	// Check if the request was successful (status code 200)
    	resp, err := client.Do(req)
    	if err != nil {
    		log.Fatal(err)
    	}
    	defer resp.Body.Close()
    
    	if resp.StatusCode == 200 {
    		// Parse the HTML content of the page using goquery
    		doc, err := goquery.NewDocumentFromReader(resp.Body)
    		if err != nil {
    			log.Fatal(err)
    		}
    
    		// Find all the search result blocks with class "gs_ri"
    		doc.Find(".gs_ri").Each(func(i int, s *goquery.Selection) {
    			// Extract the title and URL
    			titleElem := s.Find("h3.gs_rt")
    			title := titleElem.Text()
    			url, _ := titleElem.Find("a").Attr("href")
    
    			// Extract the authors and publication details
    			authorsElem := s.Find("div.gs_a")
    			authors := authorsElem.Text()
    
    			// Extract the abstract or description
    			abstractElem := s.Find("div.gs_rs")
    			abstract := abstractElem.Text()
    
    			// Print the extracted information
    			fmt.Println("Title:", title)
    			fmt.Println("URL:", url)
    			fmt.Println("Authors:", authors)
    			fmt.Println("Abstract:", abstract)
    			fmt.Println(strings.Repeat("-", 50)) // Separating search results
    		})
    	} else {
    		fmt.Println("Failed to retrieve the page. Status code:", resp.StatusCode)
    	}
    }

    This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

    Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: