The Ultimate HTML::TreeBuilder Cheatsheet in Perl

Oct 31, 2023 ยท 4 min read

HTML::TreeBuilder is a Perl module that parses HTML and XML documents into a tree structure. It allows you to manipulate the document tree easily.

Installation

To install HTML::TreeBuilder:

perl -MCPAN -e 'install HTML::TreeBuilder'

Or add it to your Perl project's cpanfile and run cpanm:

requires 'HTML::TreeBuilder';

Basic Usage

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;

$tree->parse_file("file.html");

my $root = $tree->root;

This parses the HTML file and stores the document tree in $tree. The root node is available via $tree->root.

Walking the Tree

To access child nodes:

my @children = $root->content_list;

To get specific child by index:

my $child2 = $root->content_list->[1];

Loop through children:

foreach my $child ($root->content_list) {
  # do something with $child
}

Navigate to parent:

my $parent = $node->parent;

Common Node Methods

tag

Get node's tag name:

my $tag = $node->tag;

text

Get node's inner text:

my $text = $node->text;

attr

Get attribute value by name:

my $class = $node->attr('class');

push_content

Add child node:

$parent->push_content($child);

prepend_content

Insert child at beginning:

$parent->prepend_content($newchild);

delete

Remove node:

$node->delete;

replace_with

Replace node with new node:

$oldnode->replace_with($newnode);

Searching the Tree

look_down

Find node recursively:

my $img = $root->look_down(_tag => 'img');

find_by_tag_name

Find all nodes by tag name:

my @divs = $root->find_by_tag_name('div');

find_by_attribute

Find nodes by attribute value:

my @figs = $root->find_by_attribute({class => 'figure'});

Modifying the Tree

set_tag_name

Change node's tag:

$node->set_tag_name('div');

set_text

Set node's text content:

$node->set_text("New text");

set_attr

Set attribute value:

$node->set_attr(class => 'blue');

append_child

Add child to end:

$parent->append_child($child);

Outputting HTML

as_HTML

Serialize tree back to HTML:

print $tree->as_HTML;

as_text

Output text content only:

print $tree->as_text;

Full Example

Here is an example script that loads HTML, finds all tags, and sets their width to 100:

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;
$tree->parse_file("index.html");

my @imgs = $tree->find_by_tag_name('img');

foreach my $img (@imgs) {
  $img->set_attr(width => 100);
}

print $tree->as_HTML;

Complex Tree Manipulation

More complex traversal and modification of the tree:

# Recursively find all <td> elements
my @cells = $root->look_down(sub {
  my $node = shift;
  return $node->tag eq 'td';
});

# Prune a subtree
my $pruned = $tree->prune($root->content_list->[2]);

# Swap two node positions
my $tmp = $n1->right;
$n1->right($n2->right);
$n2->right($tmp);

Custom Parsers and Handlers

Integrating custom HTML parsers:

# Use HTML::Parser for parsing malformed markup
my $parser = HTML::Parser->new(api_version => 3);
my $handler = HTML::TreeBuilder->new;

$parser->parse( $html, $handler);
my $tree = $handler->tree;

Performance and Memory Optimization

Avoid retaining entire tree in memory:

# Discard subtree after extracting info
my $info = $tree->look_down(sub {
  my $node = shift;
  return $node->text if $node->tag eq 'div';
});

$tree->prune($subtree);

Real-World Use Cases

Scraping content from HTML:

# Extract article content
my $article;
foreach my $child (@{$root->content_list}) {
  if ($child->tag eq 'article') {
    $article = $child;
    last;
  }
}

my $text = $article->as_text;

Using HTML::TreeBuilder for templating:

# Template system

my $template = HTML::TreeBuilder->new;
$template->parse(\\$html);

# ... logic to fill template ...

$template->find_by_attribute(id => 'main')
         ->replace_with($content);

print $template->as_HTML;

Tips and Tricks

  • Check if a node has children:
  • if ($node->content_list) {
      # has children
    }
    
  • Remove all children:
  • $node->delete_content;
    
  • Get first/last child:
  • my $first = $node->first_child;
    my $last = $node->last_child;
    

    Comparison with Mojo::DOM

    HTML::TreeBuilderMojo::DOM
    Maintains parent/child relationshipsNo persistent structure
    Modifying original treeParsed copy, original unchanged
    Heavier memory usageLower memory footprint
    Straightforward DOM interfaceCSS selector-based methods

    Error Handling

    # Wrap in eval block
    eval {
      $tree->parse($html);
    };
    if ($@) {
      die "Parse error: $@";
    }
    

    Browse by tags:

    Browse by language:

    The easiest way to do Web Scraping

    Get HTML from any page with a simple API call. We handle proxy rotation, browser identities, automatic retries, CAPTCHAs, JavaScript rendering, etc automatically for you


    Try ProxiesAPI for free

    curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

    <!doctype html>
    <html>
    <head>
        <title>Example Domain</title>
        <meta charset="utf-8" />
        <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1" />
    ...

    X

    Don't leave just yet!

    Enter your email below to claim your free API key: