The Ultimate Goutte Cheat Sheet for PHP

Oct 31, 2023 ยท 5 min read

Goutte is a battle-tested PHP web scraping library. This comprehensive reference aims to thoroughly cover its capabilities.

Installation

Composer:

composer require fabpot/goutte

Client Configuration

Set user agent:

$client = new Goutte\\Client();
$client->setHeaders(['User-Agent' => 'Firefox']);

Set timeouts:

$client->setTimeout(30); // connection timeout
$client->setIdleTimeout(90); // idle timeout

Handle cookies:

$client->getCookieJar()->set(new \\GuzzleHttp\\Cookie\\SetCookie('session=foo'));

Custom client:

$stack = \\GuzzleHttp\\HandlerStack::create();
$client = new \\GuzzleHttp\\Client(['handler' => $stack]);
$goutteClient = new Goutte\\Client();
$goutteClient->setClient($client);

Making Requests

GET request:

$crawler = $client->request('GET', '/products');

POST request:

$crawler = $client->request('POST', '/login', ['username' => '', 'password' => '']);

Upload files:

$crawler = $client->request('POST', '/upload', [], ['photo' => new FormData($path)]);

Attach session:

$client->getCookieJar()->set(new \\GuzzleHttp\\Cookie\\SetCookie($sessionCookie));

Follow redirects:

$client->followRedirects(true);
$crawler = $client->request('GET', $url); // follows redirects

Selecting Elements

CSS selector:

$els = $crawler->filter('div > span.title');

XPath expression:

$els = $crawler->filterXpath('//h1[@class="headline"]');

Combining CSS and XPath:

$crawler->filterXpath('//div')->filter('span.title');

Matching text:

$crawler->filterXpath('//p[contains(text(), "Hello")]');

Pagination links:

$crawler->selectLink($crawler->filterXpath('//a[text()="Next Page"]')->text());

Extracting Data

Get text:

$text = $el->text();

Get HTML:

$html = $el->html();

Get outer HTML:

$html = $el->outerHtml();

Get attribute:

$url = $el->attr('href');

Get raw response:

$response = $crawler->getResponse();

Interacting with Pages

Click link:

$link = $crawler->selectLink('Next')->link();
$crawler = $client->click($link);

Submit form:

$form = $crawler->selectButton('Submit')->form();
$crawler = $client->submit($form);

Upload file:

$form = $crawler->selectButton('Upload')->form();
$form['file'] = new \\Symfony\\Component\\HttpFoundation\\File\\UploadedFile('/path/to/file');
$crawler = $client->submit($form);

Scroll page:

$crawler->evaluateScript('window.scrollTo(0, document.body.scrollHeight)');

Handling Responses

Check status code:

$statusCode = $crawler->getResponse()->getStatus();

if ($statusCode === 200) {
  // Success
}

Get response headers:

$headers = $crawler->getResponse()->getHeaders();

Get response body:

$html = $crawler->getResponse()->getContent();

Debugging and Logging

Debug client:

$client->getClient()->getConfig('handler')->push(new \\Monolog\\Handler\\ChromePHPHandler());

Log requests:

$logger = new \\Monolog\\Logger('goutte');
$stack = new \\GuzzleHttp\\HandlerStack();
$stack->push(\\GuzzleHttp\\Middleware::log($logger));
$client = new \\GuzzleHttp\\Client(['handler' => $stack]);

Mocking Responses

Mock response:

use GuzzleHttp\\Handler\\MockHandler;

$mock = new MockHandler([
  new \\GuzzleHttp\\Psr7\\Response(200, ['Content-Type' => 'text/html'], '<html>...</html>')
]);

$handler = \\GuzzleHttp\\HandlerStack::create($mock);
$client = new Goutte\\Client(['handler' => $handler]);

Rate Limiting

Limit per second:

$client = \\GuzzleHttp\\Client([
  'handler' => \\GuzzleHttp\\HandlerStack::create(new \\GuzzleHttp\\Handler\\CurlHandler([
    'curl' => [CURLOPT_BUFFERSIZE => 1024],
  ])),
  'middleware' => new \\GuzzleHttp\\Middleware\\ThrottleMiddleware(10), // 10 requests per second
]);

Dynamic throttling:

$stack = new \\GuzzleHttp\\HandlerStack();
$stack->push(new \\SomeProvider\\DynamicThrottleMiddleware());
$client = new Goutte\\Client(['handler' => $stack]);

Asynchronous Requests

Concurrent requests:

use GuzzleHttp\\Promise;

$promises = [
  'page1' => $client->requestAsync('GET', '<https://page1.com>'),
  'page2' => $client->requestAsync('GET', '<https://page2.com>')
];

$results = Promise\\unwrap($promises);

Real World Use Cases

  • Large scale web archiving
  • Dynamic scraping against modern JS sites
  • Cloud based web automation
  • Distributed scraping with multiple clients
  • Scrapers for research papers
  • Automated financial reports
  • Creating training datasets for ML
  • Regression testing UIs with visual diffs
  • Scraping data from web API responses
  • Migrating content between CMSs
  • Price monitoring and alerting
  • Tracking website changes
  • Comparing product prices
  • Monitoring domains for brand abuse
  • Building news aggregators
  • Public data mining and analysis
  • Processing HTML datasets
  • Scraping geospatial data for mapping
  • Using with Other Libraries

    Integrate with Symfony DomCrawler for more advanced filtering:

    $crawler = $client->request('GET', '<https://example.com>');
    $domCrawler = new \\Symfony\\Component\\DomCrawler\\Crawler();
    $domCrawler->addHtmlContent($crawler->html());
    $filtered = $domCrawler->filter('div.content');
    

    Batching and Concurrency

    Improve efficiency for large scrapes by batching requests:

    $batch = new \\Goutte\\BatchClient($client);
    $batch->enqueue(['url' => 'page1.com']);
    $batch->enqueue(['url' => 'page2.com']);
    $crawlers = $batch->start();
    

    Scrape in parallel for performance using Guzzle promises:

    $promises = [
      'page1' => $client->requestAsync('GET', 'page1.com'),
      'page2' => $client->requestAsync('GET', 'page2.com')
    ];
    
    $results = \\GuzzleHttp\\Promise\\settle($promises)->wait();
    

    Best Practices

    Respect robots.txt:

    $client->getClient()->getConfig('handler')->push(RobotTxtMiddleware::create());
    

    Implement rate limiting:

    $stack->push(new \\GuzzleHttp\\Middleware\\ThrottleMiddleware(10)); // 10 rps
    

    Avoid overloading servers:

    $batch->setConcurrency(10); // only 10 concurrent requests
    

    Scraping JavaScript Sites

    Use Puppeteer to render JavaScript:

    $browser = \\Puppeteer\\Puppeteer::launch();
    $page = $browser->newPage();
    $page->goto('<https://example.com>');
    $html = $page->getHtml();
    

    Persisting Scraped Data

    Save to JSON file:

    $data = $crawler->filter('.listing')->each(function ($node) {
      return $node->text();
    });
    
    file_put_contents('listings.json', json_encode($data));
    

    Debugging Tips

    Enable Guzzle debug logging:

    $stack->push(\\GuzzleHttp\\Middleware::log($logger, LogLevel::DEBUG));
    

    Inspect headers and response codes:

    $headers = $response->getHeaders();
    $statusCode = $response->getStatusCode();
    

    Proxy and User Agent Rotation

    Rotate user agents to avoid blocks:

    $agents = ['Firefox', 'Chrome', ...];
    $client->setHeaders(['User-Agent' => $agents[array_rand($agents)]]);
    

    Use proxies for IP rotation:

    $client = new \\Goutte\\Client();
    $client->getClient()->setProxy('104.198.224.19');
    

    Useful Goutte Libraries

  • goutte-scraper - Scraper with batteries included
  • laravel-goutte - Laravel integration
  • guzzle-crawler - Powerful crawling framework
  • Real World Examples

    Scrape pricing data:

    $crawler->filter('.price')->each(function ($node) {
      return $node->text();
    });
    

    Extract contact info:

    $crawler->filter('.contact-list')->each(function ($node) {
      return $node->filter('a')->each(function ($link) {
        return $link->text();
      });
    });
    

    Troubleshooting

    Handling captchas:

    // Option 1: Use a service like AntiCaptcha
    // Option 2: Rotate proxies and retry on detection
    

    Scraping paginated content:

    while($crawler->filter('.next-page')->count() > 0) {
    
      $nextPage = $crawler->selectLink('Next')->link();
    
      $crawler = $client->click($nextPage);
    
      // Scrape page
    
    }
    

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: