Scraping All Images from a Website with R

Dec 13, 2023 · 8 min read

The first step is to load the R libraries that we will need to perform the web scraping:

library(rvest)
library(httr)
library(stringr)

The key libraries are:

  • rvest: For parsing and extracting data from HTML and XML
  • httr: For sending HTTP requests to web pages
  • stringr: For handling strings
  • Defining the URL and Headers

    Next we need to specify the URL of the web page that contains the images we want to scrape:

    url <- '<https://commons.wikimedia.org/wiki/List_of_dog_breeds>'
    

    We are scraping images of dog breeds from a Wikipedia page.

    This is page we are talking about…

    When scraping web pages, it is good practice to define a custom user agent header. This helps simulate a real browser request so the server will respond properly:

    headers <- c(
      `User-Agent` = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    )
    

    Here we are setting a Chrome browser user agent.

    Sending the HTTP Request

    To download the web page content, we can send an HTTP GET request using the httr package:

    response <- httr::GET(url, httr::add_headers(headers))
    

    This will fetch the contents of the specified url and store the response in the response object.

    Checking the Response Status

    It's good practice to check that the request succeeded before trying to parse the response. We can check the status code:

    if (httr::status_code(response) == 200) {
    
      # Request succeeded logic
    
    } else {
    
      # Failed request handling
    
    }
    

    A status code of 200 means the request was successful. Other codes indicate an error.

    Parsing the HTML

    Since the request succeeded, we can parse the HTML content using rvest:

    page <- read_html(httr::content(response, "text"))
    

    The page object now contains the parsed HTML document.

    Finding the Data Table

    Inspecting the page

    You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

    We can use XPath to find that table element:

    table <- page %>%
      html_nodes(xpath = '//*[@class="wikitable sortable"]') %>%
      html_table()
    

    Let's break this down:

  • html_nodes() finds all nodes matching the XPath selector
  • //*[@class="wikitable sortable"] selects elements with a class attribute matching "wikitable sortable"
  • html_table() converts the HTML table into a data frame
  • Now the table data is extracted into the table data frame.

    Initializing Data Storage

    As we scrape data from the table, we need variables to accumulate the results:

    names <- character()
    groups <- character()
    local_names <- character()
    photographs <- character()
    

    Empty vectors are created to store the dog name, breed group, local names, and image URLs as we extract them.

    Iterating Through the Table Rows

    To scrape the data from each row, we can iterate through the table:

    for (i in 2:length(table[[1]][, 1])) {
    
      row <- table[[1]][i, ]
    
      # Extract data for each dog breed
    
    }
    

    This skips the header row and processes each data row, storing the current row in row.

    Extracting Data from Each Column

    Now here is the most complex part - extracting each data field from the table columns:

    # Column 1: Name
    name <- row[[1]]
    
    # Column 2: Group
    group <- row[[2]]
    
    # Check column 3 for a <span> tag
    span_tag <- html_nodes(row[[3]], 'span')
    local_name <- ifelse(length(span_tag) > 0, html_text(span_tag), '')
    
    # Check column 4 for an <img>
    img_tag <- html_nodes(row[[4]], 'img')
    photograph <- ifelse(length(img_tag) > 0, html_attr(img_tag, 'src'), '')
    

    As you can see, each column requires different logic to extract the text or attributes. Let's break it down:

    Name Column:

    The name is directly in the text of column 1. We grab it with:

    name <- row[[1]]
    

    Group Column:

    The group is also basic text, extracted by:

    group <- row[[2]]
    

    Local Name Column:

    For local names, we first check if the column contains a tag:

    span_tag <- html_nodes(row[[3]], 'span')
    

    If found, we extract its text:

    local_name <- ifelse(length(span_tag) > 0, html_text(span_tag), '')
    

    Photograph Column:

    Finally, for the photo we check if an image tag exists:

    img_tag <- html_nodes(row[[4]], 'img')
    

    If yes, we grab its source URL attribute:

    photograph <- ifelse(length(img_tag) > 0, html_attr(img_tag, 'src'), '')
    

    This logic carefully handles all the edge cases that can appear when scraping semi-structured HTML.

    Downloading and Saving Images

    With the image URLs extracted, we can now download and save the photos:

    if (photograph != '') {
    
      # Download image
      # Save to file
    
    }
    

    The code checks that we have a valid photograph URL before proceeding.

    We won't include all the image download code here for brevity.

    Printing the Extracted Data

    Finally, to print out the scraped data:

    for (i in 1:length(names)) {
    
      cat("Name:", names[i], "\\n")
      cat("FCI Group:", groups[i], "\\n")
      cat("Local Name:", local_names[i], "\\n")
      cat("Photograph:", photographs[i], "\\n")
    
      cat("\\n")
    
    }
    

    This iterates through each record and prints the extracted fields.

    Handling Errors

    The code also contains logic to handle errors:

    } else {
    
      cat("Failed to retrieve the web page. Status code:", httr::status_code(response), "\\n")
    
    }
    

    If the HTTP request failed, it prints an error message with the status code.

    Full Code

    # Load the required libraries
    library(rvest)
    library(httr)
    library(stringr)
    
    # URL of the Wikipedia page
    url <- 'https://commons.wikimedia.org/wiki/List_of_dog_breeds'
    
    # Define a user-agent header to simulate a browser request
    headers <- c(
      `User-Agent` = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    )
    
    # Send an HTTP GET request to the URL with the headers
    response <- httr::GET(url, httr::add_headers(headers))
    
    # Check if the request was successful (status code 200)
    if (httr::status_code(response) == 200) {
      # Parse the HTML content of the page
      page <- read_html(httr::content(response, "text"))
    
      # Find the table with class 'wikitable sortable'
      table <- page %>%
        html_nodes(xpath = '//*[@class="wikitable sortable"]') %>%
        html_table()
    
      # Initialize lists to store the data
      names <- character()
      groups <- character()
      local_names <- character()
      photographs <- character()
    
      # Create a folder to save the images
      dir.create('dog_images', showWarnings = FALSE)
    
      # Iterate through rows in the table (skip the header row)
      for (i in 2:length(table[[1]][, 1])) {
        row <- table[[1]][i, ]
        
        # Extract data from each column
        name <- row[[1]]
        group <- row[[2]]
        
        # Check if the second column contains a span element
        span_tag <- html_nodes(row[[3]], 'span')
        local_name <- ifelse(length(span_tag) > 0, html_text(span_tag), '')
        
        # Check for the existence of an image tag within the fourth column
        img_tag <- html_nodes(row[[4]], 'img')
        photograph <- ifelse(length(img_tag) > 0, html_attr(img_tag, 'src'), '')
        
        # Download the image and save it to the folder
        if (photograph != '') {
          image_url <- photograph
          image_response <- httr::GET(image_url, httr::add_headers(headers))
          if (httr::status_code(image_response) == 200) {
            image_filename <- file.path('dog_images', paste0(name, '.jpg'))
            writeBin(httr::content(image_response, "raw"), image_filename)
          }
        }
        
        # Append data to respective lists
        names <- c(names, name)
        groups <- c(groups, group)
        local_names <- c(local_names, local_name)
        photographs <- c(photographs, photograph)
      }
    
      # Print or process the extracted data as needed
      for (i in 1:length(names)) {
        cat("Name:", names[i], "\n")
        cat("FCI Group:", groups[i], "\n")
        cat("Local Name:", local_names[i], "\n")
        cat("Photograph:", photographs[i], "\n")
        cat("\n")
      }
    
    } else {
      cat("Failed to retrieve the web page. Status code:", httr::status_code(response), "\n")
    }

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: