Railsmagazine60x60 Data Extraction with Hpricot

by Jonas Alves

Issue: Vol 2, Issue 1 - All Stuff, No Fluff

published in June 2010


Jonas Alves is a developer based in São Paulo, Brazil. He started Ruby on Rails development early in 2008 and other Ruby libraries later. Jonas is currently employed by WebGoal, where Ruby helps to develop high quality software with high return on investment quickly.

Collecting data from websites manually can be very time consuming and error-prone.

One of our customers at WebGoal had 10 employees working 10 hours/day to collect data from some websites on the internet. The company’s leaders were complaining about the cost they’re having on this, so my team proposed to automate this task.

After a day testing tools in many languages (PHP, Java, C++, C# and Ruby), we found that Hpricot is the most powerful, yet simple to use, tool of its kind.

This company was used to using PHP in all of their internal systems. After reading our document about the Hpricot and Ruby advantages, they agreed to use them.

It helped them collect more data in less time than before and with less people on the job.

What is Hpricot?

As per the Hpricot’s wiki at GitHub, "Hpricot is a very flexible HTML parser, based on Tanaka Akira’s HTree and John Resig’s jQuery, but with the scanner recoded in C." You can use it to read, navigate into and even modify any XML document.

Why should I choose Hpricot?

  • It’s simple to use. 
You can use CSS or XPath selectors.
    Any CSS selector that works on jQuery should work on Hpricot too, because Hpricot is based on it.
  • It’s fast
    Hpricot was written in the C programming language.
  • It’s less verbose
    See for yourself:

Scenario: Extracting the team members’ names from the Rails Magazine website

Ruby + Hpricot

doc = Hpricot(open("http://railsmagazine.com/team"))

team = []

doc.search(".article-content td:nth(1) a").each do |a|

  team << a.inner_text


puts team.join("\n")

PHP + DOM Document


$doc = new DOMDocument();


$team = array();

$trs = $doc->getElementsByTagName(


foreach($trs as $tr) {

  $a = $tr->getElementsByTagName('a')->item(0);

  $team[] = $a->nodeValue;


print(implode("\n", $team));


A similar comparison was included in the document we composed to convince our customer to use Ruby and Hpricot.

Look at the search methods. Hpricot shines with CSS selectors while PHP's DOM Document supports searching by only one tag or id at a time. With Hpricot's CSS selectors it's possible to find the desired elements with only one search.

  • It’s smart

    Hpricot tries to fix XHTML errors.

    In the PHP example, the DOM Document library shows 7 warnings about errors in the document. Hpricot doesn’t.
  • It’s Ruby! :)

Let’s code!

The above example is very simple. It loads the /team page at the Rails Magazine’s website and searches for the members' names.

In real life data extractions you will probably have to deal with pagination, authentication, search for something in a page, like ids, urls or names, and then use this data to load another page, and so on.

We are going to extract the Ruby Inside’s blog posts and their comments to show the basic functionalities of Hpricot. The data we will be retrieving includes the post title, author name, text and its comments, including its sender and text. 

Let's start creating classes to hold the blog posts and comments data:

class BlogPost
  attr_accessor :title, :author, :text, :comments
class Comment
  attr_accessor :sender, :text


These are simple classes with some accessible (read and write) attributes.

We will also create a class named RubyInsideExtractor, which will be responsible for retrieving the data from the blog:

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'blog_post'
require 'comment'
class RubyInsideExtractor
  attr_reader :blog_posts
  @@web_address = "http://www.rubyinside.com/"
  def initialize
    @blog_posts = []
  def import!

    puts “not implemented”



The @blog_posts array will hold all the blog posts. @@web_address has the blog address, so we don't need to repeat it.

The import! method is where we will do the extraction.

After that, we will need a script to call the extraction and show the results, let's call it main.rb:

#!/usr/bin/env ruby
require 'ruby_inside_extractor'
ri_extractor = RubyInsideExtractor.new
ri_extractor.blog_posts.each{ |post|
  puts post.title
  puts '=' * post.title.size
  puts 'by ' + post.author
  puts post.text
  post.comments.each do |comment|
    puts '~' * 10
    puts comment.sender + ' says:'
    puts comment.text

After instantiating the extractor class and calling the import! method, this script prints each of the blog posts, including author and comments.

The very first thing we have to do, is to find out how many pages are there in the blog:

def page_count
  doc = Hpricot(open(@@web_address))

  # the number of the last page is in

  # the penultimate link, inside the div

  # with the class “pagebar”
  # return doc.search(

  #   "div.pagebar a")[-2].inner_text.to_i
  return 3

  # I suggest forcing a low number because it would
  # take long to extract all the 1060~ posts

The page_count method loads the blog's homepage and finds the last page number, located in the penultimate link inside the div containing pagination stuff, div.pagebar.

For this example the most important line is commented because it would take a little long to extract all of the, currently, 107 pages.

The Hpricot method loads a document and the search method returns an Array containing all the occurrences of the given selector.

Now, we’re going to load the posts page once for each page. Change your import! method:

def import!
  1.upto(page_count) do |page_number|
    page_doc = Hpricot(open(@@web_address + 'page/'

      + page_number.to_s))


This will load an Hpricot document for each of the blog pages. For instance, the address for the 5th page is http://www.rubyinside.com/page/5.

Let’s search for the url that leads to the page with the complete text and comments for each post:

def import!
  1.upto(page_count) do |page_number|
    page_doc = Hpricot(open(@@web_address +

      'page/' + page_number.to_s))
    page_doc.search('.post.teaser').each do |entry_div|
      # we can access an element's attributes

      # as if it were a Hash
      post_url = entry_div.at('h2 > a')['href']
      @blog_posts << extract_blog_post(post_url)

If you look at the Ruby Inside HTML code, you'll find that each blog post is inside a div with the post and teaser classes. The import! method is iterating over each of these divs and retrieving the url for the full post with comments. This url is found in the link inside the post title.

After that, it calls the extract_blog_post method, which we will create next, and adds its returning value to the @blog_posts array.

The at method searches for and returns the first occurrence of the selector.

Now, with this url in hand, we can load the page that holds the post title, full text and comments:

def extract_blog_post(post_url)

  blog_post = BlogPost.new

  post_doc = Hpricot(open(post_url))



Now, let's collect the post title, author and text:

def extract_blog_post(post_url)
  blog_post = BlogPost.new
  post_doc = Hpricot(open(post_url))
  blog_post.title = post_doc.at(

    '.entryheader h1').inner_text
  blog_post.author = post_doc.at(

    'p.byline a').inner_text
  text_div = post_doc.at('.entrytext')

  # removing unwanted elements
  blog_post.text = text_div.inner_text.strip
  blog_post.comments =


After retrieving the blog title, author and text, we also called the extract_comments method. This method, which we will create next, will return an array of comments.

The remove method removes the elements from the document. We're using it because there is a <noscript> tag with text inside the div with the entrytext class.

Finally, we'll retrieve the post’s comments:

def extract_comments(comments_doc)
  comments = []
  comments_doc.search('li').each { |comment_doc|
    comment = Comment.new
    comment.sender =

    comment.text = comment_doc.at('p').inner_text
    comments << comment
  } rescue nil

After extracting every post and comments, the Ruby Inside extractor is ready. Run your main.rb to see the result. :)