play/type blog

We are creating Germany's juiciest event platform, boomloop.com. Because we love the Internet more than our own mothers. See for yourself. check out boomloop.com


14
Feb

scraping web.de, gmx and freenet addressbooks with the blackbook gem

we’re all gearing up for an ultra-portable internet where your data is not locked up in silos, but as free and careless as ringo starr. until that day comes along, coders will continue to scrape addressbooks, unsexy 2007 stylee.

until two weeks ago, the best way of doing this in ruby was the contacts gem. the new kid on the block is the blackbook gem. it’s based on mechanize and hpricot, and comes with yahoo, gmail, hotmail, and aol scrapers ready to go. it’s much nicer than contacts, and pretty much exacly what we had in mind when we developed thief. we won’t continue developing thief, as blackbook is perfectly good for the job. instead, we’ll port the gmx, web.de and freenet providers back to blackbook.

blackbook is easy to use. first install the gem.

gem install blackbook --include-dependencies

then extract addresses like this:

contacts = Blackbook.get(:username => "bill@hotmail.com", :password => "secret")

Creating new scrapers for blackbook

lets see how to implement a new provider for gmx.de. the completed provider looks like this:

require 'blackbook/importer/page_scraper'

class Blackbook::Importer::GMX < Blackbook::Importer::PageScraper
  LOGIN_URL = "https://www.gmx.net/"

  def =~( options )
    options && options[:username] =~ /@gmx\.de$/i
  end

  def login
    username, password = options[:username], options[:password]

    begin
      page = agent.get LOGIN_URL

      form = page.forms.with.name("login").first
      form.id = username
      form.p = password

      page = form.submit

      if (continue_link = page.links.select { |link| link.text =~ /E-Mail/ }.first and 
          page.at("div.index").inner_html != "Ordnerwahl")
        page = continue_link.click
      end

      @next = page
    rescue
      raise Blackbook::BadCredentialsError.new
    end
  end

  def prepare
    login
  end

  def scrape_contacts
    page = @next

    contacts = [/Posteingang/, /Archiv/, /Gesendet/].map do |folder|
      page = page.links.select { |link| link.text =~ folder }.first.click
      find_contacts(page)
    end

    contacts.inject([]) do |memo, contact|
      memo << contact unless memo.include? contact
      memo
    end
  end

  protected

    def find_contacts(page)
      links = page.search("form#MI a").select { |link| link.attributes["title"] =~ /@/ }
      links.map do |link|
        recp = link.attributes["title"].gsub(/\n/, "").split(/\s/)
        email = recp.pop.gsub(/[<>]/, "")
        fullname = recp.join(" ")

        { :name => fullname, :email => email }
      end
    end
    Blackbook.register :gmx, self
end

here’s how it’s built:

  1. create a class that extends Blackbook::Importer::PageScraper.
  2. provide a =~ method which tests if the email address can be handled by this provider.
  3. write a login method. in here you have access to a mechanize agent. use it to navigate your target page. you’ll find detailed documentation here.
  4. create a scrape_contacts method. return a contacts hash.
  5. call Blackbook.register(:gmx, self) so that blackbooks can find your provider.

Adding gmx, freenet and web.de scrapers

basically, requiring your provider and making sure it calls register is enough to get blackbook to notice you.

if you can’t be bothered and just want the goods here and now, then you can install a blackbook_extensions plugin with gmx, web.de and freenet like this:

./script/plugin install http://svn.playtype.net/plugins/schwarzesbuch/

go crazy!

Comments

There are 9 Comments for this post.  Write comment →

Exactly what I need for a soon to be started project. Thanks a lot!

Rany – thanks for writing this blog post about Blackbook and how you’re using it.

I’d happily integrate your providers into Blackbook. Just send me a patch!

May 08, 2008 at 05:44 PM von Joseariel

Hi, my name is Joseariel,

I’m working on a social network on rails and was wondering if you could post a similar article on how to create a facebook provider for blackbook. This would really help me big time as I am just getting my feet wet with rails and social networks.

Thanks,

Joseariel

Hey Jungs,

coole Sache!

1 Frage/Bitte: könntet Ihr das Teil auch für gmx.NET und gmx.CH-Adressen machen?

Wäre echt geil!
Tom

Habe gerade Freenet und Web.de getestet:
- Freenet: nimmt nur die Adressen aus den Mail-Foldern, nicht aus dem Adressbuch
- Web.de: scheint mindestens ein Bug drin zu sein.

Wäre sehr froh um eine Korrektur/Feedback!
Danke,
Tom

Nice post! I was originally planning to write a basic web scraper for my web app in PHP or RoR, but then I came across Feedity ( http://feedity.com ) which made things a lot easier. Feedity generates custom RSS feeds from webpages, and now I just consume the resulting RSS feed in my application. Simple and straight! Check it out sometime!

Hi. To use web.de you need to change the name of the login form to ‘fm’ instead of ‘login’.
Also, I needed to remove the brackets from the emails for my application. The gmx.de importer removed these by default.
Thanks for the awesome work. Any other german email clients in the pipes?

@ivor: ive moved the project to github. highly appreciated if any changes are made there!

Hi,I get a Blackbook::BadCredentialsError while importing contacts from Gmail.

contacts = Blackbook.get :username => ‘shiv.hemanth@gmail.com’, :password => ‘xxxxxx’, :as => :xml
Blackbook::BadCredentialsError: Must be authenticated to access contacts.
from /home/hemanth/5latest/2_0_branch/vendor/plugins/cheald-blackbook/lib/blackbook/importer/gmail.rb:58:in `scrape_contacts’
from /home/hemanth/5latest/2_0_branch/vendor/plugins/cheald-blackbook/lib/blackbook/importer/page_scraper.rb:30:in `fetch_contacts!’
from /home/hemanth/5latest/2_0_branch/vendor/plugins/cheald-blackbook/lib/blackbook/importer/base.rb:29:in `import’
from /home/hemanth/5latest/2_0_branch/vendor/plugins/cheald-blackbook/lib/blackbook.rb:32:in `export’
from /home/hemanth/5latest/2_0_branch/vendor/plugins/cheald-blackbook/lib/blackbook.rb:56:in `get’
from /home/hemanth/5latest/2_0_branch/vendor/plugins/cheald-blackbook/lib/blackbook.rb:16:in `get’

Write a comment

Required in bold.