Scraping Wikipedia Pages with Node.js

Dec 6, 2023 · 6 min read

Wikipedia is a goldmine of structured data on people, places, events and more. However, this data is trapped in HTML markup across thousands of pages. Scraping allows programmatic access to scrape and collect relevant information from Wikipedia.

In this comprehensive guide, we will walk through a real-world example of scraping a Wikipedia page using Node.js. We will use the npm packages axios and cheerio to demonstrate a common web scraping pattern - sending HTTP requests and parsing DOM elements.

Use Case

Why would you want to scrape Wikipedia? Here are some examples:

  • Aggregate data from multiple article tables into a JSON/CSV dataset
  • Build a database of entities like people, books, films etc.
  • Power a custom search or Q&A application with Wikipedia knowledge
  • The use cases are endless. In our case, we will scrape a table with data on all Presidents of the United States.

    Overview

    Here is an overview of the web scraping process:

    1. Send HTTP request to fetch the Wikipedia page HTML
    2. Parse the HTML content using cheerio
    3. Find the element(s) containing relevant data
    4. Extract and transform data into desired structure
    5. Output/store extracted data

    Let's go through each step to scrape president data.

    Setup

    We will use axios for HTTP requests and cheerio for DOM manipulation:

    const axios = require('axios');
    const cheerio = require('cheerio');
    
    Note: Make sure to install these libraries with npm install axios cheerio

    We define the Wikipedia URL to scrape:

    const url = "<https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States>"
    

    This is the table we are talking about

    And headers to mimic a browser request:

    const headers = {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    };
    

    This helps avoid bot blocking from Wikipedia servers.

    Step 1: Send HTTP Request

    We use axios to send a GET request to the Wikipedia URL:

    axios.get(url, {headers})
      .then(response => {
        // request succeeded
      })
      .catch(error => {
         console.error("Request failed: ", error);
      });
    

    We handle the promise fulfilled and rejected cases.

    Inside the fulfilled callback, we first check the status code:

    if (response.status === 200) {
      // Success!
    } else {
      console.log("Failed with status: ", response.status);
    }
    

    Status 200 means the request succeeded. We now have the entire Wikipedia page HTML stored in response.data ready for parsing.

    Step 2: Parse HTML

    We use cheerio which provides a jQuery style DOM manipulation API.

    const $ = cheerio.load(response.data);
    

    This allows traversing the DOM using CSS selectors and methods like .find(), .text() etc.

    Step 3: Find Relevant Data

    We want to extract the tabular data on presidents.

    Inspecting the page

    When we inspect the page we can see that the table has a class called wikitable and sortable

    const table = $("table.wikitable.sortable");
    

    This selects the first matching table element.

    Step 4: Extract and Transform Data

    We initialize an array to store extracted rows:

    const data = [];
    

    We loop through table rows, slicing to offset header:

    table.find("tr").slice(1).each((index, row) => {
      // row logic
    });
    

    Inside, we map each , cell to text:

    const row_data = $(row).find("td, th").map((index, column) => {
      return $(column).text().trim();
    }).get();
    

    And append row data to results array:

    data.push(row_data);
    

    This gives a 2D array storing each row as an array of cell texts.

    Step 5: Output Data

    Finally, we can print the extracted president data:

    data.forEach(president => {
      console.log("Number:", president[0]);
      console.log("Name:", president[2]);
      // other properties
    });
    

    And now you have a Node.js Wikipedia scraper!

    Full Code

    const axios = require('axios');
    const cheerio = require('cheerio');
    
    // Define the URL of the Wikipedia page
    const url = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States";
    
    // Define a user-agent header to simulate a browser request
    const headers = {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    };
    
    // Send an HTTP GET request to the URL with the headers
    axios.get(url, { headers })
      .then(response => {
        // Check if the request was successful (status code 200)
        if (response.status === 200) {
          // Load the HTML content of the page using Cheerio
          const $ = cheerio.load(response.data);
    
          // Find the table with the specified class name
          const table = $("table.wikitable.sortable");
    
          // Initialize an empty array to store the table data
          const data = [];
    
          // Iterate through the rows of the table
          table.find("tr").slice(1).each((index, row) => {
            const columns = $(row).find("td, th");
    
            // Extract data from each column and push it to the data array
            const row_data = columns.map((index, column) => {
              return $(column).text().trim();
            }).get();
    
            data.push(row_data);
          });
    
          // Print the scraped data for all presidents
          data.forEach(president_data => {
            console.log("President Data:");
            console.log("Number:", president_data[0]);
            console.log("Name:", president_data[2]);
            console.log("Term:", president_data[3]);
            console.log("Party:", president_data[5]);
            console.log("Election:", president_data[6]);
            console.log("Vice President:", president_data[7]);
            console.log();
          });
        } else {
          console.log("Failed to retrieve the web page. Status code:", response.status);
        }
      })
      .catch(error => {
        console.error("An error occurred:", error);
      });

    The full data-set of presidents is now programmatically available to power further applications.

    Key Takeaways

  • Use axios for HTTP requests and cheerio for DOM parsing
  • Inspect target elements thoroughly before writing logic
  • Handle errors and HTTP statuses properly
  • Map/transform extracted data into desired structure
  • Respect robots.txt and avoid overloading servers
  • Scrape data responsibly!
  • Going Further

    Some ways you can extend this:

  • Scrape multiple wiki tables
  • Store data in database/file formats
  • Enhance with image downloads, conflict handling etc.
  • Create API or custom search engine
  • In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: