Overcoming CAPTCHAs When Web Scraping with PHP

Feb 20, 2024 ยท 4 min read

Web scraping can be a useful technique for collecting data from websites. However, many sites use CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) to prevent scraping by bots. In this guide, we'll explore methods for handling CAPTCHAs when scraping with PHP.

What is a CAPTCHA and Why Do Sites Use Them?

A CAPTCHA is a type of challenge-response test used to determine if a user is human. They often involve reading distorted text or identifying images. CAPTCHAs aim to block bots and scrapers while allowing human users through.

Sites use CAPTCHAs to prevent abuse of their services. For example, mass ticket purchases from a ticketing site or sending spam from a webmail provider. CAPTCHAs can be annoying for legitimate human users, but remain an effective bot deterrent.

Approaches for Bypassing CAPTCHAs When Scraping

When you encounter CAPTCHAs while scraping, there are a few approaches you can try:

Use a CAPTCHA Solving Service

Several online services offer human and machine-learning powered CAPTCHA solving for a fee. They provide an API you can send the CAPTCHA image or audio to and get back the solved response. This allows you to incorporate CAPTCHA solving into your scraper code.

Some popular CAPTCHA solving services to check out include Anti-Captcha and DeathByCaptcha.

Here's an example using Anti-Captcha's API:

$client = new AntiCaptchaClient("your_api_key");

$captcha_image_base64 = file_get_contents($captcha_url); 

$solved_text = $client->solveCaptcha($captcha_image_base64);

Use a Browser Automation Tool

Browser automation tools like Selenium allow you to programmatically drive a real web browser. This means CAPTCHAs can be solved manually or using image recognition within the browser session.

The approach would be:

  1. Load the target page in the browser using Selenium
  2. Detect when the CAPTCHA appears
  3. Bring it into focus and solve it manually or with image recognition
  4. Resume the scraping script after the CAPTCHA is solved

Here's some sample Selenium + PHP code:

// start chrome browser via Selenium 
$driver = new ChromeDriver(); 

// load target page
$driver->get('http://example.com');

// solve captcha logic would go here...

// get page source to scrape
$html = $driver->getPageSource();

The main downside to this method is it doesn't scale well unless you create a browser farm, which is complex to set up.

Use a Proxy Service

Some proxy services rotate IP addresses with each request, making it appear you are a normal human visitor. This allows you to bypass sites that lock out scraping from a single IP after too many requests.

Scraping through proxies can result in getting around CAPTCHAs, but this tactic is becoming less effective over time as sites improve detection.

Ethical Considerations for CAPTCHA Solving

While there are methods for defeating CAPTCHAs, it raises some ethical concerns to consider:

  • Respect site owner intent - If a site purposefully impedes scraping, workarounds undermine their wishes and controls.
  • Follow site terms of service - Make sure your scraping doesn't violate a site's ToS agreement.
  • Limit resource usage - Solving CAPTCHAs incurs additional costs for the site owner, so scrape responsibly.
  • I'd recommend reaching out to a site owner before scraping protected data at excess scale. There may be an official API or data offering available instead.

    Scraping public data at reasonable volumes is usually fine. Just be sure to fly under the radar and not trigger IP locks or abuse alerts.

    Key Takeaways for Handling CAPTCHAs When Scraping

    Here are the main tips to remember:

  • Services like Anti-Captcha allow solving CAPTCHAs via API requests.
  • Browser automation using Selenium lets you manually or programmatically solve CAPTCHAs.
  • Proxy services can hide scraping activity, avoiding CAPTCHAs.
  • Consider site owner perspective, terms of service, and resource usage when solving CAPTCHAs.
  • With the right approach and precautions, it is possible to overcome CAPTCHAs for web scraping projects. The methods discussed provide options to responsibly scrape sites employing protections.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: