Building a Simple Proxy Rotator with Scala and Scraping

Oct 2, 2023 · 5 min read

When starting a web crawling project or needing to scale to a few hundred requests, having a simple proxy rotator that populates itself from free proxy pools can be useful.

We can use a site like https://sslproxies.org/ to fetch public proxies periodically and use them in Scala projects.

Let's walk through a simple tutorial to do this in Scala. We'll be using ScalaJS to handle web scraping.

This is what the site looks like:

And if you check the HTML using the inspect tool, you will see the full content is encapsulated in a table with the id proxylisttable

The IP and port are the first and second elements in each row.

We can use the following code to select the table and its rows to iterate on and further pull out the first and second elements of the elements.

Setup

First, make sure you have SBT installed and create a new Scala project:

sbt new scala/scala-seed.g8

Add the ScalaJS dependencies:

libraryDependencies += "org.scala-js" %%% "scalajs-dom" % "1.1.0"

Also add the upickle dependency for JSON handling:

libraryDependencies += "com.lihaoyi" %%% "upickle" % "1.3.11"

Fetching Proxies

Let's define a simple function to fetch the HTML from the proxy site:

import org.scalajs.dom
import org.scalajs.dom.ext._

import upickle.default._

def fetchProxies(): String = {

  val url = "<https://sslproxies.org/>"

  val request = dom.ext.Ajax.get(url)
  val response = request.responseText

  response
}

This uses ScalaJS Ajax to send a GET request and returns the HTML response.

Parsing the Proxies

Next we can parse the HTML to extract the proxies. The proxies are in a table, so we can use ScalaJS DOM APIs to find the rows:

import org.scalajs.dom

def parseProxies(html: String): List[Map[String, String]] = {

  val doc = dom.document.create.html(html)

  val rows = doc.getElementById("proxylisttable").getElementsByTagName("tr")

  rows.map{ row =>
    Map(
      "ip" -> row.childElement(0).textContent,
      "port" -> row.childElement(1).textContent
    )
  }.toList
}

This finds the table, gets the rows, and extracts the IP and port for each row.

Putting It Together

We can wrap this in a function to fetch and parse the proxies:

import scala.util.Random

def getProxies(): List[Map[String, String]] = {

  val html = fetchProxies()
  val proxies = parseProxies(html)

  proxies
}

def randomProxy(): Map[String, String] = {
  val proxies = getProxies()
  val randomIndex = Random.nextInt(proxies.length)

  proxies(randomIndex)
}

getProxies fetches and parses the HTML. randomProxy calls it and picks a random proxy.

To use a proxy:

val proxy = randomProxy()

// proxy hostname
proxy("ip")

// proxy port
proxy("port")

Scheduling Fetches

To populate proxies periodically, we can use a Scala scheduling library like akka:

import akka.actor.ActorSystem
import akka.stream.scaladsl.Source

implicit val system = ActorSystem()

val fetchProxySource = Source.tick(0 seconds, 5 minutes, "Fetch")

fetchProxySource.mapAsync(1) {
  case "Fetch" =>
    getProxies()
    Source.empty
}.runForeach( proxies =>
  // save proxies
)

This fetches fresh proxies every 5 minutes.

Summary

This gives a simple Scala proxy rotator using ScalaJS for web scraping. For production use cases, a dedicated proxy service may be better for reliability and scale. But for small projects, this provides a handy way to avoid blocks with free proxy rotations.

The full code is:

import org.scalajs.dom
import org.scalajs.dom.ext._

import upickle.default._

import scala.util.Random

import akka.actor.ActorSystem
import akka.stream.scaladsl.Source

import scala.concurrent.duration._

object ProxyRotator {

  implicit val system = ActorSystem()

  def fetchProxies(): String = {

    val url = "https://sslproxies.org/"
    
    val request = dom.ext.Ajax.get(url)
    val response = request.responseText
    
    response
  }  

  def parseProxies(html: String): List[Map[String, String]] = {

    val doc = dom.document.create.html(html)
    
    val rows = doc.getElementById("proxylisttable").getElementsByTagName("tr")

    rows.map{ row =>
      Map(
        "ip" -> row.childElement(0).textContent,
        "port" -> row.childElement(1).textContent  
      )
    }.toList
  }
  
  def getProxies(): List[Map[String, String]] = {

    val html = fetchProxies()  
    val proxies = parseProxies(html)
    
    proxies
  }

  def randomProxy(): Map[String, String] = {
    val proxies = getProxies()
    val randomIndex = Random.nextInt(proxies.length)
    
    proxies(randomIndex)
  }
  
  val fetchProxySource = Source.tick(0 seconds, 5 minutes, "Fetch")

  fetchProxySource.mapAsync(1) {
    case "Fetch" =>
      getProxies()
      Source.empty
  }.runForeach(proxies =>
    // save proxies
  )

}

If you want to use this in production and want to scale to thousands of links, then you will find that many free proxies won't hold up under the speed and reliability requirements. In this scenario, using a rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world • With our automatic IP rotation • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions) • With our automatic CAPTCHA solving technology
  • Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

    A simple API can access the whole thing like below in any programming language.

    curl "<http://api.proxiesapi.com/?key=API_KEY&url=https://example.com>"
    

    We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: