play/type blog

We are creating Germany's juiciest event platform, boomloop.com. Because we love the Internet more than our own mothers. See for yourself. check out boomloop.com


we’re all gearing up for an ultra-portable internet where your data is not locked up in silos, but as free and careless as ringo starr. until that day comes along, coders will continue to scrape addressbooks, unsexy 2007 stylee.

until two weeks ago, the best way of doing this in ruby was the contacts gem. the new kid on the block is the blackbook gem. it’s based on mechanize and hpricot, and comes with yahoo, gmail, hotmail, and aol scrapers ready to go. it’s much nicer than contacts, and pretty much exacly what we had in mind when we developed thief. we won’t continue developing thief, as blackbook is perfectly good for the job. instead, we’ll port the gmx, web.de and freenet providers back to blackbook.

blackbook is easy to use. first install the gem.

gem install blackbook --include-dependencies

then extract addresses like this:

contacts = Blackbook.get(:username => "bill@hotmail.com", :password => "secret")

Creating new scrapers for blackbook

lets see how to implement a new provider for gmx.de. the completed provider looks like this:

require 'blackbook/importer/page_scraper'

class Blackbook::Importer::GMX < Blackbook::Importer::PageScraper
  LOGIN_URL = "https://www.gmx.net/"

  def =~( options )
    options && options[:username] =~ /@gmx\.de$/i
  end

  def login
    username, password = options[:username], options[:password]

    begin
      page = agent.get LOGIN_URL

      form = page.forms.with.name("login").first
      form.id = username
      form.p = password

      page = form.submit

      if (continue_link = page.links.select { |link| link.text =~ /E-Mail/ }.first and 
          page.at("div.index").inner_html != "Ordnerwahl")
        page = continue_link.click
      end

      @next = page
    rescue
      raise Blackbook::BadCredentialsError.new
    end
  end

  def prepare
    login
  end

  def scrape_contacts
    page = @next

    contacts = [/Posteingang/, /Archiv/, /Gesendet/].map do |folder|
      page = page.links.select { |link| link.text =~ folder }.first.click
      find_contacts(page)
    end

    contacts.inject([]) do |memo, contact|
      memo << contact unless memo.include? contact
      memo
    end
  end

  protected

    def find_contacts(page)
      links = page.search("form#MI a").select { |link| link.attributes["title"] =~ /@/ }
      links.map do |link|
        recp = link.attributes["title"].gsub(/\n/, "").split(/\s/)
        email = recp.pop.gsub(/[<>]/, "")
        fullname = recp.join(" ")

        { :name => fullname, :email => email }
      end
    end
    Blackbook.register :gmx, self
end

here’s how it’s built:

  1. create a class that extends Blackbook::Importer::PageScraper.
  2. provide a =~ method which tests if the email address can be handled by this provider.
  3. write a login method. in here you have access to a mechanize agent. use it to navigate your target page. you’ll find detailed documentation here.
  4. create a scrape_contacts method. return a contacts hash.
  5. call Blackbook.register(:gmx, self) so that blackbooks can find your provider.

Adding gmx, freenet and web.de scrapers

basically, requiring your provider and making sure it calls register is enough to get blackbook to notice you.

if you can’t be bothered and just want the goods here and now, then you can install a blackbook_extensions plugin with gmx, web.de and freenet like this:

./script/plugin install http://svn.playtype.net/plugins/schwarzesbuch/

go crazy!