Scraping All Images from a Website with Elixir

Dec 13, 2023 · 8 min read

Here is a step-by-step guide to scraping a website for images using Elixir. This article will explain the code for scraping dog breed information and images from a Wikipedia page, to help beginners understand the key concepts.

This is page we are talking about…

Overview

The goal of this scraper is to extract dog breed names, details like categories, local names, and images from a Wikipedia page listing hundreds of breeds.

It will:

  1. Retrieve the web page content
  2. Parse the page to extract information
  3. Download all images of dog breeds
  4. Save images and print extracted data

The code uses the Elixir programming language along with several libraries:

  • HTTPClient - for making HTTP requests to get web page content
  • URI - to parse the page URL
  • Floki - to parse HTML and extract data
  • File - to save images and data to files
  • Retrieving the Web Page

    The first step is to retrieve the content of the web page that contains the data we want to scrape.

    The get_page/2 function makes an HTTP GET request to the URL using the HTTPClient library:

    defp get_page(url, headers) do
      case :httpc.request(:get, {URI.parse(url), headers: headers}, [], []) do
        {:ok, {{_, 200, _},_ , body}} ->
          {:ok, body}
        {:ok, {{_, status_code, _},_ , _}} ->
          {:error, status_code}
        {:error, reason} ->
          {:error, reason}
      end
    end
    

    This makes the request, checks the status code, and if a 200 OK response is received, returns the page body.

    The headers contain a user agent string to identify the scraper to the server.

    The start function calls this getter, handling any errors:

    case get_page(@url, headers) do
      {:ok, body} ->
        # parse page
      {:error, reason} ->
        IO.puts("Failed to retrieve the web page. Status code: #{reason}")
    end
    

    So at this point if successful, the body contains the full HTML of the web page.

    Parsing the Page

    Inspecting the page

    You can see when you use the chrome inspect tool that the data is in a table element with the class wikitable and sortable

    Selecting the Table We use the Floki.find/2 function to locate this table:

    table = Floki.find(document, "table.wikitable.sortable")
    

    The table variable now contains the HTML representation of the table we want to scrape data from.

    Iterating Through Rows Inside the table, data is organized in rows, with each row containing information about a specific dog breed. We use a loop to iterate through these rows and extract relevant data:

    for row <- tl(Floki.find(table, "tr")) do
      # Extract data from the row
    end
    

    The tl/1 function is used to skip the table header row, as it doesn't contain the data we need.

    Extracting Data from Columns Within each row, data is stored in columns. We use Floki.find/2 to locate and extract data from these columns. Each row contains four columns: Name, FCI Group, Local Name, and Photograph.

    columns = Floki.find(row, "td,th")
    
    name = Floki.find(columns |> hd, "a") |> hd |> Floki.text() |> String.trim()
    group = columns |> Enum.at(1) |> Floki.text() |> String.trim()
    local_name = case Floki.find(columns |> Enum.at(2), "span") do
      [] -> ""
      [span] -> Floki.text(span) |> String.trim()
    end
    
    img_tag = Floki.find(columns |> Enum.at(3), "img")
    photograph = case img_tag do
      [] -> ""
      [img] -> Floki.attribute(img, "src")
    end
    

    Here's what each extraction step does:

  • name: Extracts the breed's name by locating an anchor tag in the first column and trimming any extra spaces.
  • group: Extracts the FCI Group from the second column and trims extra spaces.
  • local_name: Extracts the Local Name from the third column (if available) by targeting a tag.
  • photograph: Extracts the Photograph URL from the fourth column by finding an tag and retrieving its "src" attribute.
  • Downloading Images

    After extracting image sources, we can download the actual image data:

    We reuse the HTTPClient library to fetch each image by URL.

    If successful, we write the image binary data to a file using the breed's name and the File module.

    The save_images/1 function coordinates calling this for every image URL extracted earlier.

    Saving and Printing Output

    Finally, save_images/1 stores images while print_data/1 prints out all extracted breed data for debugging and verification.

    The full code can be seen below, showing how these pieces fit together into a complete scraper:

    defmodule DogBreedsScraper do
      @url 'https://commons.wikimedia.org/wiki/List_of_dog_breeds'
    
      def start do
        headers = [
          {"User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"}
        ]
    
        case get_page(@url, headers) do
          {:ok, body} ->
            case parse_page(body) do
              {:ok, data} ->
                save_images(data)
                print_data(data)
              {:error, reason} ->
                IO.puts("Failed to parse the page: #{reason}")
            end
          {:error, reason} ->
            IO.puts("Failed to retrieve the web page. Status code: #{reason}")
        end
      end
    
      defp get_page(url, headers) do
        case :httpc.request(:get, {URI.parse(url), headers: headers}, [], []) do
          {:ok, {{_, 200, _}, _, body}} ->
            {:ok, body}
          {:ok, {{_, status_code, _}, _, _}} ->
            {:error, status_code}
          {:error, reason} ->
            {:error, reason}
        end
      end
    
      defp parse_page(body) do
        case Floki.parse(body) do
          {:ok, document} ->
            table = Floki.find(document, "table.wikitable.sortable")
    
            names = []
            groups = []
            local_names = []
            photographs = []
    
            for row <- tl(Floki.find(table, "tr")) do
              columns = Floki.find(row, "td,th")
    
              if length(columns) == 4 do
                name = Floki.find(columns |> hd, "a") |> hd |> Floki.text() |> String.trim()
                group = columns |> Enum.at(1) |> Floki.text() |> String.trim()
                local_name = case Floki.find(columns |> Enum.at(2), "span") do
                  [] -> ""
                  [span] -> Floki.text(span) |> String.trim()
                end
    
                img_tag = Floki.find(columns |> Enum.at(3), "img")
                photograph = case img_tag do
                  [] -> ""
                  [img] -> Floki.attribute(img, "src")
                end
    
                names = [name | names]
                groups = [group | groups]
                local_names = [local_name | local_names]
                photographs = [photograph | photographs]
    
                if photograph != "" do
                  download_image(photograph, name)
                end
              end
            end
    
            {:ok, Enum.reverse(names), Enum.reverse(groups), Enum.reverse(local_names), Enum.reverse(photographs)}
          _ ->
            {:error, "Failed to parse the page"}
        end
      end
    
      defp download_image(photograph, name) do
        case get_image(photograph) do
          {:ok, image_data} ->
            image_filename = "dog_images/#{name}.jpg"
            File.write(image_filename, image_data)
          _ ->
            IO.puts("Failed to download image: #{photograph}")
        end
      end
    
      defp get_image(url) do
        case :httpc.request(:get, {URI.parse(url)}, [], []) do
          {:ok, {{_, 200, _}, _, body}} ->
            {:ok, body}
          _ ->
            {:error, "Failed to download image"}
        end
      end
    
      defp save_images(data) do
        File.mkdir_p("dog_images")
        Enum.zip(data |> elem(0), data |> elem(3))
        |> Enum.each(fn {name, photograph} -> download_image(photograph, name) end)
      end
    
      defp print_data({names, groups, local_names, photographs}) do
        Enum.each(0..(length(names) - 1), fn i ->
          IO.puts("Name: #{Enum.at(names, i)}")
          IO.puts("FCI Group: #{Enum.at(groups, i)}")
          IO.puts("Local Name: #{Enum.at(local_names, i)}")
          IO.puts("Photograph: #{Enum.at(photographs, i)}")
          IO.puts()
        end)
      end
    end
    
    # Start the scraping process
    DogBreedsScraper.start()

    In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: