we’re all gearing up for an ultra-portable internet where your data is not locked up in silos, but as free and careless as ringo starr. until that day comes along, coders will continue to scrape addressbooks, unsexy 2007 stylee.
until two weeks ago, the best way of doing this in ruby was the contacts gem. the new kid on the block is the blackbook gem. it’s based on mechanize and hpricot, and comes with yahoo, gmail, hotmail, and aol scrapers ready to go. it’s much nicer than contacts, and pretty much exacly what we had in mind when we developed thief. we won’t continue developing thief, as blackbook is perfectly good for the job. instead, we’ll port the gmx, web.de and freenet providers back to blackbook.
blackbook is easy to use. first install the gem.
gem install blackbook --include-dependencies
then extract addresses like this:
contacts = Blackbook.get(:username => "bill@hotmail.com", :password => "secret")
Creating new scrapers for blackbook
lets see how to implement a new provider for gmx.de. the completed provider looks like this:
require 'blackbook/importer/page_scraper'
class Blackbook::Importer::GMX < Blackbook::Importer::PageScraper
LOGIN_URL = "https://www.gmx.net/"
def =~( options )
options && options[:username] =~ /@gmx\.de$/i
end
def login
username, password = options[:username], options[:password]
begin
page = agent.get LOGIN_URL
form = page.forms.with.name("login").first
form.id = username
form.p = password
page = form.submit
if (continue_link = page.links.select { |link| link.text =~ /E-Mail/ }.first and
page.at("div.index").inner_html != "Ordnerwahl")
page = continue_link.click
end
@next = page
rescue
raise Blackbook::BadCredentialsError.new
end
end
def prepare
login
end
def scrape_contacts
page = @next
contacts = [/Posteingang/, /Archiv/, /Gesendet/].map do |folder|
page = page.links.select { |link| link.text =~ folder }.first.click
find_contacts(page)
end
contacts.inject([]) do |memo, contact|
memo << contact unless memo.include? contact
memo
end
end
protected
def find_contacts(page)
links = page.search("form#MI a").select { |link| link.attributes["title"] =~ /@/ }
links.map do |link|
recp = link.attributes["title"].gsub(/\n/, "").split(/\s/)
email = recp.pop.gsub(/[<>]/, "")
fullname = recp.join(" ")
{ :name => fullname, :email => email }
end
end
Blackbook.register :gmx, self
end
here’s how it’s built:
- create a class that extends Blackbook::Importer::PageScraper.
- provide a =~ method which tests if the email address can be handled by this provider.
- write a login method. in here you have access to a mechanize agent. use it to navigate your target page. you’ll find detailed documentation here.
- create a scrape_contacts method. return a contacts hash.
- call Blackbook.register(:gmx, self) so that blackbooks can find your provider.
Adding gmx, freenet and web.de scrapers
basically, requiring your provider and making sure it calls register is enough to get blackbook to notice you.
if you can’t be bothered and just want the goods here and now, then you can install a blackbook_extensions plugin with gmx, web.de and freenet like this:
./script/plugin install http://svn.playtype.net/plugins/schwarzesbuch/
go crazy!



