Scraping Reddit Posts in Kotlin

Jan 9, 2024 · 5 min read

In this beginner-friendly tutorial, we will be scraping information from Reddit posts using a simple Kotlin script. We will send a request to Reddit, download the HTML content, parse it, and extract key data like title, author, score etc.

here is the page we are talking about

Importing Libraries

We need two external libraries for this script:

khttp - To send HTTP requests to the Reddit URL

Jsoup - To parse and process the HTML content

import khttp.get
import org.jsoup.Jsoup

No need to understand these libraries in depth right now. Just know that khttp gets web content and Jsoup processes HTML.

Sending Request

We define the Reddit URL and a User-Agent header:

val redditUrl = "<https://www.reddit.com>"

val headers = mapOf(
   "User-Agent" to "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
)

The User-Agent makes sure Reddit knows we are a browser.

Then we send a GET request and store the response:

val response = get(redditUrl, headers = headers)

We are retrieving the html content of Reddit frontpage.

Saving HTML Content

We check if our request was successful:

if (response.statusCode == 200) {

   // process response

} else {

   // request failed
}

Status code 200 means our GET request succeeded. We save the HTML text to a file:

val htmlContent = response.text

val filename = "reddit_page.html"

java.io.File(filename).writeText(htmlContent, Charsets.UTF_8)

The HTML of Reddit frontpage is now saved locally to process further.

Parsing HTML

To extract information, we need to parse the HTML content. Jsoup helps parse and traverse HTML documents:

val document = Jsoup.parse(htmlContent)

We have a parsed representation of the entire Reddit frontpage HTML.

Extracting Data

This is where we actually scrape information from the Reddit posts. We use selectors to find elements and extract data.

Understanding Selectors

Selectors let us query elements in the HTML document like a database. Some examples:

/* Tag and class */
div.post

/* Nested tags */
article > div.post-title

/* Attributes */
a[href^='/r/']

We use CSS-style selectors to target specific elements on the page. Here's how it works:

Tags and Classes

  • Select div tags with class post:
  • div.post
    

    Matches

    Nesting

  • Select div tags inside article tags:
  • article > div
    

    Matches

    Hello

    Attributes

  • Anchor tags a with href starting with /r/:
  • a[href^='/r/']
    

    Matches

    This lets us precisely target elements to extract data from.

    Selecting Reddit Post Blocks

    Inspecting the elements

    Upon inspecting the HTML in Chrome, you will see that each of the posts have a particular element shreddit-post and class descriptors specific to them…

    In our code, we select Reddit posts using the shreddit-post class and other attributes:

    val blocks = document.select("shreddit-post.block.relative.cursor-pointer.bg-neutral-background.focus-within:bg-neutral-background-hover.hover:bg-neutral-background-hover.xs:rounded-[16px].p-md.my-2xs.nd:visible")
    

    This complex selector targets Reddit post blocks on the page. Let's break it down:

  • shreddit-post - The class for post blocks
  • block relative - Styling classes applied to posts
  • bg-neutral-background - More styling classes for background
  • focus-within hover - Classes applied on hover/focus
  • xs:rounded - Rounded corners on small screens
  • p-md my-2xs - Padding and margin classes
  • nd:visible - Visibility class
  • So in simple terms, we are selecting post blocks by the shreddit-post class. The other classes narrow down styling to extract actual posts.

    Advanced selectors let us hone in on the exact set of elements we want. We could also use IDs or other attributes to target elements.

    Extracting Post Data

    Inside the selected post blocks, we can extract information:

    for (block in blocks) {
    
      val permalink = block.attr("permalink")
    
      val contentHref = block.attr("content-href")
    
      // extract other attributes..
    
    }
    

    The attr() method gets an attribute value from the element. For example, permalink contains the post URL, author has the Reddit username etc.

    Some key attributes we are extracting:

    permalink - Post URL
    contentHref - URL to comments
    commentCount - Number of comments
    postTitle - Title of the post
    author - Username of poster
    score - Upvote count
    

    And that's it! We have extracted the data we wanted from Reddit posts. The output prints this information for each post.

    The full code again:

    import khttp.get
    import org.jsoup.Jsoup
    
    fun main() {
        // Define the Reddit URL you want to download
        val redditUrl = "https://www.reddit.com"
    
        // Define a User-Agent header
        val headers = mapOf(
            "User-Agent" to "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"
        )
    
        // Send a GET request to the URL with the User-Agent header
        val response = get(redditUrl, headers = headers)
    
        // Check if the request was successful (status code 200)
        if (response.statusCode == 200) {
            // Get the HTML content of the page
            val htmlContent = response.text
    
            // Specify the filename to save the HTML content
            val filename = "reddit_page.html"
    
            // Save the HTML content to a file
            java.io.File(filename).writeText(htmlContent, Charsets.UTF_8)
    
            println("Reddit page saved to $filename")
    
            // Parse the HTML content
            val document = Jsoup.parse(htmlContent)
    
            // Find all blocks with the specified tag and class
            val blocks = document.select("shreddit-post.block.relative.cursor-pointer.bg-neutral-background.focus-within:bg-neutral-background-hover.hover:bg-neutral-background-hover.xs:rounded-[16px].p-md.my-2xs.nd:visible")
    
            // Iterate through the blocks and extract information from each one
            for (block in blocks) {
                val permalink = block.attr("permalink")
                val contentHref = block.attr("content-href")
                val commentCount = block.attr("comment-count")
                val postTitle = block.select("div[slot=title]").text().trim()
                val author = block.attr("author")
                val score = block.attr("score")
    
                // Print the extracted information for each block
                println("Permalink: $permalink")
                println("Content Href: $contentHref")
                println("Comment Count: $commentCount")
                println("Post Title: $postTitle")
                println("Author: $author")
                println("Score: $score")
                println()
            }
        } else {
            println("Failed to download Reddit page (status code ${response.statusCode})")
        }
    }

    While scrapers can get complex with handling JavaScript, cookies etc - this shows the basic concepts like sending requests, parsing HTML, and using selectors to extract data.

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: