Web Scraping All The Images From a Website in Node.js

Dec 13, 2023 · 6 min read

Web scraping can automate data collection from websites. In this comprehensive tutorial, we'll scrape a Wikipedia page to extract dog breed information and images using Node.js.

This is page we are talking about…

Getting Set Up

We'll use these Node modules:

npm install axios cheerio fs
const fs = require('fs');
const axios = require('axios');
const cheerio = require('cheerio');

axios makes HTTP requests

cheerio parses HTML and helps query/manipulate DOM

fs provides file system methods

Defining the Target URL

We'll scrape the Wikipedia List of Dog Breeds page:

const url = '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>';

We'll also define a User-Agent header to mimic a real browser request:

const headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
};

Making the HTTP Request

Let's use axios to fetch the page content:

axios.get(url, {headers})
  .then(response => {

    // Request succeeded, let's scrape!

  })
  .catch(error => {

    // Handle errors

  });

On success, the full HTML is in response.data.

Loading and Parsing with Cheerio

To query/manipulate DOM elements, we need to load raw HTML into cheerio:

const $ = cheerio.load(response.data);

The $ gives us jQuery-style DOM selectors.

Extracting the Main Data Table

Inspecting the page

You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

We can reference it like this:

const table = $('table.wikitable.sortable');

Initializing Data Arrays

Let's initialize arrays to store extracted data:

const names = [];
const groups = [];
const localNames = [];
const photographs = [];

We'll also create a folder to save images:

if (!fs.existsSync('dog_images')) {
  fs.mkdirSync('dog_images');
}

Scraping Row Data

We can loop through the rows and use selectors to extract cell data:

$('tr', table).each((index, row) => {

  const columns = $('td, th', row);

  // Skip header
  if (columns.length === 4) {

    const name = $('a', columns.eq(0)).text();
    const group = columns.eq(1).text();

    const localName = $('span', columns.eq(2)).text() || '';

    const img = columns.eq(3).find('img');
    const photograph = img.attr('src') || '';

    // Extract and store data

  }

});

Key selectors:

  • tr - gets all rows
  • td, th - cells
  • .text() - extract text
  • .find() - descendant image tag
  • .attr() - get image src URL
  • Downloading and Saving Images

    For each image link we find, we can download the image file:

    if (photograph) {
    
      axios.get(photograph, {responseType: 'arraybuffer'})
        .then(response => {
    
          fs.writeFileSync(`dog_images/${name}.jpg`, response.data);
    
        });
    
    }
    

    So for each row, we've now extracted the key data and images into organized arrays and files!

    From here you might:

  • Further process data
  • Export to CSV, JSON
  • Insert into database
  • Train machine learning models
  • Here is the full code:

    const fs = require('fs');
    const axios = require('axios');
    const cheerio = require('cheerio');
    
    // URL of the Wikipedia page
    const url = 'https://commons.wikimedia.org/wiki/List_of_dog_breeds';
    
    // Define a user-agent header to simulate a browser request
    const headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    };
    
    // Send an HTTP GET request to the URL with the headers
    axios.get(url, { headers })
        .then(response => {
            if (response.status === 200) {
                // Load the HTML content of the page using cheerio
                const $ = cheerio.load(response.data);
    
                // Find the table with class 'wikitable sortable'
                const table = $('table.wikitable.sortable');
    
                // Initialize arrays to store the data
                const names = [];
                const groups = [];
                const localNames = [];
                const photographs = [];
    
                // Create a folder to save the images
                if (!fs.existsSync('dog_images')) {
                    fs.mkdirSync('dog_images');
                }
    
                // Iterate through rows in the table (skip the header row)
                $('tr', table).each((index, row) => {
                    const columns = $('td, th', row);
                    if (columns.length === 4) {
                        // Extract data from each column
                        const name = $('a', columns.eq(0)).text().trim();
                        const group = columns.eq(1).text().trim();
    
                        // Check if the second column contains a span element
                        const spanTag = columns.eq(2).find('span');
                        const localName = spanTag.text().trim() || '';
    
                        // Check for the existence of an image tag within the fourth column
                        const imgTag = columns.eq(3).find('img');
                        const photograph = imgTag.attr('src') || '';
    
                        // Download the image and save it to the folder
                        if (photograph) {
                            axios.get(photograph, { responseType: 'arraybuffer' })
                                .then(imageResponse => {
                                    if (imageResponse.status === 200) {
                                        const imageFilename = `dog_images/${name}.jpg`;
                                        fs.writeFileSync(imageFilename, imageResponse.data);
                                    }
                                })
                                .catch(error => {
                                    console.error('Failed to download image:', error);
                                });
                        }
    
                        // Append data to respective arrays
                        names.push(name);
                        groups.push(group);
                        localNames.push(localName);
                        photographs.push(photograph);
                    }
                });
    
                // Print or process the extracted data as needed
                for (let i = 0; i < names.length; i++) {
                    console.log("Name:", names[i]);
                    console.log("FCI Group:", groups[i]);
                    console.log("Local Name:", localNames[i]);
                    console.log("Photograph:", photographs[i]);
                    console.log();
                }
            } else {
                console.log("Failed to retrieve the web page. Status code:", response.status);
            }
        })
        .catch(error => {
            console.error("Error:", error);
        });

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: