Feb

scraping web.de, gmx and freenet addressbooks with the blackbook gem
we’re all gearing up for an ultra-portable internet where your data is not locked up in silos, but as free and careless as ringo starr. until that day comes along, coders will continue to scrape addressbooks, unsexy 2007 stylee.
until two weeks ago, the best way of doing this in ruby was the contacts gem. the new kid on the block is the blackbook gem. it’s based on mechanize and hpricot, and comes with yahoo, gmail, hotmail, and aol scrapers ready to go. it’s much nicer than contacts, and pretty much exacly what we had in mind when we developed thief. we won’t continue developing thief, as blackbook is perfectly good for the job. instead, we’ll port the gmx, web.de and freenet providers back to blackbook.
blackbook is easy to use. first install the gem.
gem install blackbook --include-dependencies
then extract addresses like this:
contacts = Blackbook.get(:username => "bill@hotmail.com", :password => "secret")
Creating new scrapers for blackbook
lets see how to implement a new provider for gmx.de. the completed provider looks like this:
require 'blackbook/importer/page_scraper'
class Blackbook::Importer::GMX < Blackbook::Importer::PageScraper
LOGIN_URL = "https://www.gmx.net/"
def =~( options )
options && options[:username] =~ /@gmx\.de$/i
end
def login
username, password = options[:username], options[:password]
begin
page = agent.get LOGIN_URL
form = page.forms.with.name("login").first
form.id = username
form.p = password
page = form.submit
if (continue_link = page.links.select { |link| link.text =~ /E-Mail/ }.first and
page.at("div.index").inner_html != "Ordnerwahl")
page = continue_link.click
end
@next = page
rescue
raise Blackbook::BadCredentialsError.new
end
end
def prepare
login
end
def scrape_contacts
page = @next
contacts = [/Posteingang/, /Archiv/, /Gesendet/].map do |folder|
page = page.links.select { |link| link.text =~ folder }.first.click
find_contacts(page)
end
contacts.inject([]) do |memo, contact|
memo << contact unless memo.include? contact
memo
end
end
protected
def find_contacts(page)
links = page.search("form#MI a").select { |link| link.attributes["title"] =~ /@/ }
links.map do |link|
recp = link.attributes["title"].gsub(/\n/, "").split(/\s/)
email = recp.pop.gsub(/[<>]/, "")
fullname = recp.join(" ")
{ :name => fullname, :email => email }
end
end
Blackbook.register :gmx, self
end
here’s how it’s built:
- create a class that extends Blackbook::Importer::PageScraper.
- provide a =~ method which tests if the email address can be handled by this provider.
- write a login method. in here you have access to a mechanize agent. use it to navigate your target page. you’ll find detailed documentation here.
- create a scrape_contacts method. return a contacts hash.
- call Blackbook.register(:gmx, self) so that blackbooks can find your provider.
Adding gmx, freenet and web.de scrapers
basically, requiring your provider and making sure it calls register is enough to get blackbook to notice you.
if you can’t be bothered and just want the goods here and now, then you can install a blackbook_extensions plugin with gmx, web.de and freenet like this:
./script/plugin install http://svn.playtype.net/plugins/schwarzesbuch/
go crazy!



Comments
There are 9 Comments for this post. Write comment →
Exactly what I need for a soon to be started project. Thanks a lot!
Rany – thanks for writing this blog post about Blackbook and how you’re using it.
I’d happily integrate your providers into Blackbook. Just send me a patch!
Hi, my name is Joseariel,
I’m working on a social network on rails and was wondering if you could post a similar article on how to create a facebook provider for blackbook. This would really help me big time as I am just getting my feet wet with rails and social networks.
Thanks,
Joseariel
Hey Jungs,
coole Sache!
1 Frage/Bitte: könntet Ihr das Teil auch für gmx.NET und gmx.CH-Adressen machen?
Wäre echt geil!
Tom
Habe gerade Freenet und Web.de getestet:
- Freenet: nimmt nur die Adressen aus den Mail-Foldern, nicht aus dem Adressbuch
- Web.de: scheint mindestens ein Bug drin zu sein.
Wäre sehr froh um eine Korrektur/Feedback!
Danke,
Tom
Nice post! I was originally planning to write a basic web scraper for my web app in PHP or RoR, but then I came across Feedity ( http://feedity.com ) which made things a lot easier. Feedity generates custom RSS feeds from webpages, and now I just consume the resulting RSS feed in my application. Simple and straight! Check it out sometime!
Hi. To use web.de you need to change the name of the login form to ‘fm’ instead of ‘login’.
Also, I needed to remove the brackets from the emails for my application. The gmx.de importer removed these by default.
Thanks for the awesome work. Any other german email clients in the pipes?
@ivor: ive moved the project to github. highly appreciated if any changes are made there!
Hi,I get a Blackbook::BadCredentialsError while importing contacts from Gmail.
contacts = Blackbook.get :username => ‘shiv.hemanth@gmail.com’, :password => ‘xxxxxx’, :as => :xml
Blackbook::BadCredentialsError: Must be authenticated to access contacts.
from /home/hemanth/5latest/2_0_branch/vendor/plugins/cheald-blackbook/lib/blackbook/importer/gmail.rb:58:in `scrape_contacts’
from /home/hemanth/5latest/2_0_branch/vendor/plugins/cheald-blackbook/lib/blackbook/importer/page_scraper.rb:30:in `fetch_contacts!’
from /home/hemanth/5latest/2_0_branch/vendor/plugins/cheald-blackbook/lib/blackbook/importer/base.rb:29:in `import’
from /home/hemanth/5latest/2_0_branch/vendor/plugins/cheald-blackbook/lib/blackbook.rb:32:in `export’
from /home/hemanth/5latest/2_0_branch/vendor/plugins/cheald-blackbook/lib/blackbook.rb:56:in `get’
from /home/hemanth/5latest/2_0_branch/vendor/plugins/cheald-blackbook/lib/blackbook.rb:16:in `get’
Write a comment
Required in bold.