10 Best Open Source Web Scraping Tools

May 7th, 2020

Cheerio JS

Cheerio is ideal for programmers with experience in JQuery. You can deploy Cheerio JS on the server-side to do web scraping easily using JQuery selectors.

Scraping can be as simple as

const cheerio = require('cheerio')
const $ = cheerio.load('Hello world')

$('h2.title').text('Hello there!')
$('h2').addClass('welcome')

$.html()
//=> Hello there!

Website: https://cheerio.js.org/

Sponsor cheerio https://cheerio.js.org/

Beautiful Soup

One of the aftermaths of the Internet Explorer era is how badly formed most HTML on the web is. It's one of the common realities you are hit with when you start any web scraping project.

No library wrangles with bad HTML as well as beautiful Soup.

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn't take much code to write an application. It also handles all encoding issues automatically.

Link https://www.crummy.com/software/BeautifulSoup/

Kimura Framework

A brilliantly simple Ruby-based framework that can render javascript and comes out of the box with headless chromium and Firefox

Here is how simple it is to work with infinite scroll web pages

# infinite_scroll_spider.rb
require 'kimurai'

class InfiniteScrollSpider < Kimurai::Base
  @name = "infinite_scroll_spider"
  @engine = :selenium_chrome
  @start_urls = ["https://infinite-scroll.com/demo/full-page/"]

  def parse(response, url:, data: {})
    posts_headers_path = "//article/h2"
    count = response.xpath(posts_headers_path).count

    loop do
      browser.execute_script("window.scrollBy(0,10000)") ; sleep 2
      response = browser.current_response

      new_count = response.xpath(posts_headers_path).count
      if count == new_count
        logger.info "> Pagination is done" and break
      else
        count = new_count
        logger.info "> Continue scrolling, current count is #{count}..."
      end
    end

    posts_headers = response.xpath(posts_headers_path).map(&:text)
    logger.info "> All posts from page: #{posts_headers.join('; ')}"
  end
end

InfiniteScrollSpider.crawl!

Link https://github.com/vifreefly/kimuraframework

Import.io

Import IO is an enterprise-grade web scraping service that is quite popular.

They help you set up, maintain, monitor, crawl, and scrape data.

They also help you visualize data with the chart, graphs, and excellent reporting functions.

Link https://www.import.io/

Goutte

Goutte is a screen scraping and web crawling library for PHP.

Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.

PHP Version: PHP 7.1 .

Example of submitting a form in Goutte

$crawler = $client->request('GET', 'https://github.com/');
$crawler = $client->click($crawler->selectLink('Sign in')->link());
$form = $crawler->selectButton('Sign in')->form();
$crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx'));
$crawler->filter('.flash-error')->each(function ($node) {
    print $node->text()."\n";
});

Link github.com

Scrapy

Scrapy is an extremely powerful crawling and scraping library written in Python.

Here is how easy it is to create a multi-threaded crawler and parse it at a single endpoint.

import scrapy

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').getall():
            yield {"title": h3}

        for href in response.xpath('//a/@href').getall():
            yield scrapy.Request(response.urljoin(href), self.parse)

And to scrape, it allows both XPath and CSS selectors.

>>> response.xpath('//span/text()').get()
'good'
>>> response.css('span::text').get()
'good'

Link here http://doc.scrapy.org/en/latest/

Mechanical Soup

Mechanical Soup is a super simple library that helps you scrape, store and pass cookies, submit forms, etc. but it doesn't support Javascript rendering.

Here is an example of submitting a form and scraping the results on Duck Duck Go

import mechanicalsoup

# Connect to duckduckgo
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://duckduckgo.com/")

# Fill-in the search form
browser.select_form('#search_form_homepage')
browser["q"] = "MechanicalSoup"
browser.submit_selected()

# Display the results
for link in browser.page.select('a.result__a'):
    print(link.text, '->', link.attrs['href'])

Link here https://github.com/MechanicalSoup/MechanicalSoup

PySpider

PySpider is useful if you want to crawl and spider at massive scales. It has a web UI to monitor crawling projects, support DB integrations out of the box, uses message queues, and comes ready with support for a distributed architecture. This library is a beast.

You can do complex operations like...

Set priorities.

def index_page(self):
    self.crawl('http://www.example.org/page2.html', callback=self.index_page)
    self.crawl('http://www.example.org/233.html', callback=self.detail_page,
               priority=1)

Set delayed crawls. This one crawls after 30 mins using queues

import time
def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               exetime=time.time() 30*60)

this one automatically recrawls a page every 5 hours

def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               age=5*60*60, auto_recrawl=True)

Link http://docs.pyspider.org/en/latest/

NodeCrawler

This powerful crawling and scraping package for Node Js allows server-side DOM and injection of JQuery and has queueing support with controllable pool sizes, priority settings, and rate limit control.

It's great for working with bottlenecks like rate limits that many websites impose.

Here is an example that does that.

var crawler = require('crawler');

var c = new Crawler({
    rateLimit: 2000,
    maxConnections: 1,
    callback: function(error, res, done) {
        if(error) {
            console.log(error)
        } else {
            var $ = res.$;
            console.log($('title').text())
        }
        done();
    }
})

// if you want to crawl some website with 2000ms gap between requests
c.queue('http://www.somewebsite.com/page/1')
c.queue('http://www.somewebsite.com/page/2')
c.queue('http://www.somewebsite.com/page/3')

// if you want to crawl some website using proxy with 2000ms gap between requests for each proxy
c.queue({
    uri:'http://www.somewebsite.com/page/1',
    limiter:'proxy_1',
    proxy:'proxy_1'
})
c.queue({
    uri:'http://www.somewebsite.com/page/2',
    limiter:'proxy_2',
    proxy:'proxy_2'
})
c.queue({
    uri:'http://www.somewebsite.com/page/3',
    limiter:'proxy_3',
    proxy:'proxy_3'
})
c.queue({
    uri:'http://www.somewebsite.com/page/4',
    limiter:'proxy_1',
    proxy:'proxy_1'
})

Link http://nodecrawler.org/#basic-usage

Selenium Web Driver

Selenium was built for automating tasks on web browsers but is very effective in web scraping as well.

Here you are controlling the Firefox browser and automating a search query.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import presence_of_element_located

#This example requires Selenium WebDriver 3.13 or newer
with webdriver.Firefox() as driver:
    wait = WebDriverWait(driver, 10)
    driver.get("https://google.com/ncr")
    driver.find_element_by_name("q").send_keys("cheese"   Keys.RETURN)
    first_result = wait.until(presence_of_element_located((By.CSS_SELECTOR, "h3>div")))
    print(first_result.get_attribute("textContent"))

Its language agnostic, so here is the same thing accomplished using Javascript.

const {Builder, By, Key, until} = require('selenium-webdriver');

(async function example() {
    let driver = await new Builder().forBrowser('firefox').build();
    try {
        // Navigate to Url
        await driver.get('https://www.google.com');

        // Enter text "cheese" and perform keyboard action "Enter"
        await driver.findElement(By.name('q')).sendKeys('cheese', Key.ENTER);

        let firstResult = await driver.wait(until.elementLocated(By.css('h3>div')), 10000);

        console.log(await firstResult.getAttribute('textContent'));
    }
    finally{
        driver.quit();
    }

Link https://selenium.dev/documentation/en/

Puppeteer

Puppeteer lives up to its name and comes closest to full-scale browser automation. It can do more or less everything that a human can do.

It can take screenshots, render javascript, submit forms, simulate keyboard input,

This example takes a screenshot of the Ycombinator home page in very few lines of code.

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'});
  await page.pdf({path: 'hn.pdf', format: 'A4'});

  await browser.close();
})();

Link https://github.com/puppeteer/puppeteer

Colly

Colly is a super fast and scalable and extremely popular spider/scraper.

It supported web crawling, rate limiting, caching, parallel scraping, cookie, and session handling and distributed scraping.

Here is an example of fetching 2 URLs in parallel.

package main

import (
	"fmt"

	"github.com/gocolly/colly/v2"
	"github.com/gocolly/colly/v2/queue"
)

func main() {
	url := "https://httpbin.org/delay/1"

	// Instantiate default collector
	c := colly.NewCollector(colly.AllowURLRevisit())

	// create a request queue with 2 consumer threads
	q, _ := queue.New(
		2, // Number of consumer threads
		&queue.InMemoryQueueStorage{MaxSize: 10000}, // Use default queue storage
	)

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("visiting", r.URL)
		if r.ID < 15 {
			r2, err := r.New("GET", fmt.Sprintf("%s?x=%v", url, r.ID), nil)
			if err == nil {
				q.AddRequest(r2)
			}
		}
	})

	for i := 0; i < 5; i   {
		// Add URLs to the queue
		q.AddURL(fmt.Sprintf("%s?n=%d", url, i))
	}
	// Consume URLs
	q.Run(c)

}

Link https://github.com/gocolly/colly

Hello world

Hello there!

Get our articles in your inbox