Web Scraping New York Times News Headlines in Ruby

Dec 6, 2023 · 5 min read

Web scraping allows you to automatically extract data from websites - it's useful for collecting large volumes of data for analysis. Here we'll scrape article titles and links from the New York Times homepage.

Prerequisites

Before scraping any site, we need:

  • Ruby installed - I'd recommend Ruby 2.7+
  • Some gems including nokogiri for parsing and net/http for sending requests
  • A good IDE like VS Code or Atom to write code
  • You can install these in Windows, Linux or MacOS environments.

    Walkthrough

    Here's how the NYTimes scraper works:

    First we require the gems we need:

    require 'net/http'
    require 'nokogiri'
    

    Next we set the target URL to scrape:

    url = '<https://www.nytimes.com/>'
    

    We define a user agent header that mimics a browser's request - this helps avoid getting blocked:

    headers = {
      "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    }
    

    Using Net::HTTP, we send a GET request to the URL and store the response:

    response = Net::HTTP.get_response(URI(url), headers)
    

    We check if the request was successful via the status code:

    if response.code == "200"
    

    If good, we parse the HTML using Nokogiri:

    doc = Nokogiri::HTML(response.body)
    

    Inspecting the page

    We now inspect element in chrome to see how the code is structured…

    You can see that the articles are contained inside section tags and with the class story-wrapper

    We grab article sections using a CSS selector:

    article_sections = doc.css('section.story-wrapper')
    

    Within these, we find titles and links via more selectors:

    title_element = article_section.at_css('h3.indicate-hover')
    
    link_element = article_section.at_css('a.css-9mylee')
    

    If found, we extract and store them:

    article_titles << article_title
    article_links << article_link
    

    Finally, we print the scraped data:

    article_titles.zip(article_links).each do |title, link|
      puts "Title: #{title}"
      puts "Link: #{link}"
    end
    

    And that's the gist of how this scraper works!

    Here is the full code

    require 'net/http'
    require 'nokogiri'
    
    # URL of The New York Times website
    url = 'https://www.nytimes.com/'
    
    # Define a user-agent header to simulate a browser request
    headers = {
      "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    }
    
    # Send an HTTP GET request to the URL
    response = Net::HTTP.get_response(URI(url), headers)
    
    # Check if the request was successful (status code 200)
    if response.code == "200" 
      # Parse the HTML content of the page
      doc = Nokogiri::HTML(response.body)
    
      # Find all article sections with class 'story-wrapper'
      article_sections = doc.css('section.story-wrapper')
    
      # Initialize lists to store the article titles and links
      article_titles = []
      article_links = []
    
      # Iterate through the article sections
      article_sections.each do |article_section|
        # Check if the article title element exists
        title_element = article_section.at_css('h3.indicate-hover')
        
        # Check if the article link element exists  
        link_element = article_section.at_css('a.css-9mylee')
    
        # If both title and link are found, extract and append 
        if title_element && link_element
          article_title = title_element.text.strip
          article_link = link_element['href']
          article_titles << article_title
          article_links << article_link
        end
      end
    
      # Print or process the extracted article titles and links
      article_titles.zip(article_links).each do |title, link|
        puts "Title: #{title}"
        puts "Link: #{link}"
        puts
      end
    
    else
      puts "Failed to retrieve the web page. Status code: #{response.code}"
    end

    Practical Considerations

    Handling errors - We check for status codes and handles cases where page wasn't retrieved properly.

    Adaptability - The CSS selectors could be tweaked to scrape other parts of articles.

    Blocking - Rotating user agents helps avoid getting blocked by sites.

    Legalities - Do check a website's terms before scraping to avoid issues!

    Key Takeways

  • Web scraping can extract data from sites to CSV/JSON programatically
  • Tools like Nokogiri parser and Net::HTTP help scrape in Ruby
  • Mimicking browsers and rotating user agents helps avoid blocks
  • CSS selectors identify parts of page to extract data from
  • Always handle errors and exceptions when scraping sites
  • In more advanced implementations you will need to even rotate the User-Agent string so the website cant tell its the same browser!

    If we get a little bit more advanced, you will realize that the server can simply block your IP ignoring all your other tricks. This is a bummer and this is where most web crawling projects fail.

    Overcoming IP Blocks

    Investing in a private rotating proxy service like Proxies API can most of the time make the difference between a successful and headache-free web scraping project which gets the job done consistently and one that never really works.

    Plus with the 1000 free API calls running an offer, you have almost nothing to lose by using our rotating proxy and comparing notes. It only takes one line of integration to its hardly disruptive.

    Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    The whole thing can be accessed by a simple API like below in any programming language.

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: