Downloading Images from a Website with Rust and scraper

Oct 15, 2023 · 5 min read

In this article, we will learn how to use Rust and the reqwest and scraper crates to download all the images from a Wikipedia page.

—-

Overview

The goal is to extract the names, breed groups, local names, and image URLs for all dog breeds listed on this Wikipedia page. We will store the image URLs, download the images and save them to a local folder.

Here are the key steps we will cover:

  1. Add required crates
  2. Send HTTP request to fetch the Wikipedia page
  3. Parse the page HTML using scraper
  4. Find the table with dog breed data
  5. Iterate through the table rows
  6. Extract data from each column
  7. Download images and save locally
  8. Print/process extracted data

Let's go through each of these steps in detail.

Crates

We need these crates:

use reqwest::{Client, Response};
use scraper::{Html, Selector};
use std::fs::File;
  • reqwest - HTTP client
  • scraper - Scrapes HTML
  • std::fs - Filesystem handling
  • Send HTTP Request

    To download the web page:

    let url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>";
    let client = Client::new();
    
    let response: Response = client.get(url)
        .header("User-Agent", "Mozilla/5.0")
        .send()
        .unwrap();
    

    We create a reqwest::Client and make a GET request with a user agent header.

    Parse HTML

    To parse the HTML:

    let html = Html::parse_document(&response.text().unwrap());
    

    The scraper::Html struct parses the page content.

    Find Breed Table

    We use a CSS selector to find the table:

    let selector = Selector::parse("table.wikitable.sortable").unwrap();
    let table = html.select(&selector).next().unwrap();
    

    This selects the table element by its CSS classes.

    Iterate Through Rows

    We loop through the rows:

    for row in table.select(&Selector::parse("tr").unwrap()) {
    
      // Extract data
    
    }
    

    We select all elements within the table.

    Extract Column Data

    Inside the loop, we extract the column data:

    let cells = row.select(&Selector::parse("td, th").unwrap());
    
    let name = cells.next().unwrap().text().collect::<String>();
    let group = cells.nth(1).unwrap().text().collect::<String>();
    
    let local_name = cells.nth(2)
        .unwrap()
        .select(&Selector::parse("span").unwrap())
        .next()
        .map(|e| e.text().collect::<String>())
        .unwrap_or_default();
    
    let img = cells.nth(3).unwrap().value().attr("src");
    let photograph = img.unwrap_or_default();
    

    We use text() to extract text and attr() to get attributes.

    Download Images

    To download and save images:

    if !photograph.is_empty() {
    
      let image_data = client.get(photograph).send().unwrap().bytes().unwrap();
    
      let image_path = format!("dog_images/{}.jpg", name);
      let mut file = File::create(image_path).unwrap();
      file.write_all(&image_data).unwrap();
    
    }
    

    We reuse the reqwest::Client to download the image bytes and write them to a file.

    Store Extracted Data

    We store the extracted data:

    // Store in vectors
    names.push(name);
    groups.push(group);
    local_names.push(local_name);
    photographs.push(photograph);
    

    The vectors can then be processed as needed.

    And that's it! Here is the full code:

    // Imports
    use reqwest::{Client, Response};
    use scraper::{Html, Selector};
    use std::fs::File;
    
    // Vectors to store data
    let mut names = Vec::new();
    let mut groups = Vec::new();
    let mut local_names = Vec::new();
    let mut photographs = Vec::new();
    
    // HTTP client
    let client = Client::new();
    
    // Fetch page
    let url = "<https://commons.wikimedia.org/wiki/List_of_dog_breeds>";
    let response: Response = client.get(url)
        .header("User-Agent", "Mozilla/5.0")
        .send()
        .unwrap();
    
    // Parse HTML
    let html = Html::parse_document(&response.text().unwrap());
    
    // Find table
    let selector = Selector::parse("table.wikitable.sortable").unwrap();
    let table = html.select(&selector).next().unwrap();
    
    // Iterate rows
    for row in table.select(&Selector::parse("tr").unwrap()) {
    
      // Get cells
      let cells = row.select(&Selector::parse("td, th").unwrap());
    
      // Extract data
      let name = cells.next().unwrap().text().collect::<String>();
      let group = cells.nth(1).unwrap().text().collect::<String>();
    
      let local_name = cells.nth(2)
          .unwrap()
          .select(&Selector::parse("span").unwrap())
          .next()
          .map(|e| e.text().collect::<String>())
          .unwrap_or_default();
    
      let img = cells.nth(3).unwrap().value().attr("src");
      let photograph = img.unwrap_or_default();
    
      // Download image
      if !photograph.is_empty() {
    
        let image_data = client.get(photograph).send().unwrap().bytes().unwrap();
    
        let image_path = format!("dog_images/{}.jpg", name);
        let mut file = File::create(image_path).unwrap();
        file.write_all(&image_data).unwrap();
    
      }
    
      // Store data
      names.push(name);
      groups.push(group);
      local_names.push(local_name);
      photographs.push(photograph);
    
    }
    

    This provides a complete Rust solution using reqwest and scraper to scrape data and images from HTML tables. The same approach can be applied to extract data from many websites.

    While these examples are great for learning, scraping production-level sites can pose challenges like CAPTCHAs, IP blocks, and bot detection. Rotating proxies and automated CAPTCHA solving can help.

    Proxies API offers a simple API for rendering pages with built-in proxy rotation, CAPTCHA solving, and evasion of IP blocks. You can fetch rendered pages in any language without configuring browsers or proxies yourself.

    This allows scraping at scale without headaches of IP blocks. Proxies API has a free tier to get started. Check out the API and sign up for an API key to supercharge your web scraping.

    With the power of Proxies API combined with Python libraries like Beautiful Soup, you can scrape data at scale without getting blocked.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: