Web Scraping Google Scholar in Perl

Jan 21, 2024 · 6 min read

This is the Google Scholar result page we are talking about…

Installing Required Perl Modules

To run this script, you need to have Perl installed along with the LWP::UserAgent and HTML::TreeBuilder modules.

To install these:

cpan LWP::UserAgent
cpan HTML::TreeBuilder

Understanding The Code

Below we will walk through what each section of code is doing to scrape Google Scholar.

First we load the required modules:

use LWP::UserAgent;
use HTML::TreeBuilder;

Next we define the URL of the Google Scholar search results page we want to scrape:

my $url = "<https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=>";

Then we create a User-Agent header that identifies us as a Chrome browser to Google:

my $ua = LWP::UserAgent->new(
   agent => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
);

We send a GET request to fetch the Google Scholar page content:

my $response = $ua->get($url);

We check that the request succeeded with a 200 status code:

if ($response->is_success) {
   # Parse page content
} else {
   # Request failed
}

If successful, we parse the HTML content using HTML::TreeBuilder:

my $tree = HTML::TreeBuilder->new;
$tree->parse($response->content);

Extracting Search Results

Inspecting the code

You can see that the items are enclosed in a

element with the class gs_ri

Now we get to the key part - extracting information from the search result items.

We locate all search result blocks by their "gs_ri" class:

my @search_results = $tree->look_down(_tag => 'div', class => 'gs_ri');

We loop through each search result:

foreach my $result (@search_results) {
   # Extract info from $result
}

Inside this loop is where we extract the title, URL, authors, and abstract fields.

Extracting The Title

We get the title text from the "gs_rt" element:

my $title_elem = $result->look_down(_tag => 'h3', class => 'gs_rt');
my $title = $title_elem ? $title_elem->as_text : "N/A";

We also extract the link URL:

my $url = $title_elem ? $title_elem->look_down(_tag => 'a')->attr('href') : "N/A";

Extracting The Authors

For the authors field, we get the text from the "gs_a" element:

my $authors_elem = $result->look_down(_tag => 'div', class => 'gs_a');
my $authors = $authors_elem ? $authors_elem->as_text : "N/A";

Extracting The Abstract

Finally, we take the abstract or description from the "gs_rs" element:

my $abstract_elem = $result->look_down(_tag => 'div', class => 'gs_rs');
my $abstract = $abstract_elem ? $abstract_elem->as_text : "N/A";

This shows how each piece of information is extracted from elements in the search result HTML.

The full code to put this together:

use LWP::UserAgent;
use HTML::TreeBuilder;

# Define the URL of the Google Scholar search page
my $url = "https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=transformers&btnG=";

# Define a User-Agent header
my $ua = LWP::UserAgent->new(
    agent => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
);

# Send a GET request to the URL with the User-Agent header
my $response = $ua->get($url);

# Check if the request was successful (status code 200)
if ($response->is_success) {
    # Parse the HTML content of the page using HTML::TreeBuilder
    my $tree = HTML::TreeBuilder->new;
    $tree->parse($response->content);

    # Find all the search result blocks with class "gs_ri"
    my @search_results = $tree->look_down(_tag => 'div', class => 'gs_ri');

    # Loop through each search result block and extract information
    foreach my $result (@search_results) {
        # Extract the title and URL
        my $title_elem = $result->look_down(_tag => 'h3', class => 'gs_rt');
        my $title = $title_elem ? $title_elem->as_text : "N/A";
        my $url = $title_elem ? $title_elem->look_down(_tag => 'a')->attr('href') : "N/A";

        # Extract the authors and publication details
        my $authors_elem = $result->look_down(_tag => 'div', class => 'gs_a');
        my $authors = $authors_elem ? $authors_elem->as_text : "N/A";

        # Extract the abstract or description
        my $abstract_elem = $result->look_down(_tag => 'div', class => 'gs_rs');
        my $abstract = $abstract_elem ? $abstract_elem->as_text : "N/A";

        # Print the extracted information
        print("Title: $title\n");
        print("URL: $url\n");
        print("Authors: $authors\n");
        print("Abstract: $abstract\n");
        print("-" x 50 . "\n");  # Separating search results
    }
} else {
    print("Failed to retrieve the page. Status code: " . $response->code . "\n");
}

This is great as a learning exercise but it is easy to see that even the proxy server itself is prone to get blocked as it uses a single IP. In this scenario where you may want a proxy that handles thousands of fetches every day using a professional rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    In fact, you don't even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so:

    curl "<http://api.proxiesapi.com/?key=API_KEY&render=true&url=https://example.com>"
    
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key.

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: