About writing scrapers

We need to get data into EveryPolitician, and scrapers are a big part of that process. We’re busy writing more, but we’d be very happy if you wrote one for us.

Maybe you’ll write the scraper that pulls in the data for your own country. Or maybe you’ll just pick one that we need (see under: “Scraper needed”) simply because you’re a kind-hearted, helpful developer. Either way: we love you!

The basic task

The problem your scraper is solving is to convert the source data (typically, but not always, a website or page) into a format we can easily import.

Also, note that your scraper will not be part of the EveryPolitician codebase — it’s supporting technology, certainly, but EveryPolitician is really only interested in the data it provides. EveryPolitician doesn’t run scapers; it just consumes their data.

Using morph.io

It’s certainly not mandatory, but if you don’t already have somewhere to run and host your scraper, we encourage you to use morph.io. It’s a platform that not only hosts and runs your scraper, but also exposes the data it yields over an API. This is great for EveryPolitician — it means we can grab the output from your scraper easily just by knowing its URL.

But it’s also helpful for anyone else who could use that data directly. (This makes sense when you consider that your scraper might be collecting some fields which EveryPolitician discards — we ignore fields that we’re not intested in storing, because we have a mission to keep the data consistent and useful across all the countries; but some of the data we choose to skip might be useful to someone else.)

Multiple sources: we merge!

It turns out that it’s not uncommon for a country’s politician data to come from more than one source. That’s OK: EveryPolitician expects to have to merge them. A simple example of how this might come about is if the politicians’ membership information (that is: who’s in which political party) is available on one site, but their date-of-birth and gender data is held on another.

In such cases, simply write separate scrapers, and let EveryPolitician deal with joining them together. Of course you’ll need to provide some common field or fields that allow the separate sources to be mapped together: often this is the politician’s name. Then we can merge the different data on that. And, yes, we can ease some of the potential pain by using some fuzzy name-matching if the spelling or format is a little wobbly on the different sources.

An example

Since we’ve written a lot of scrapers for this project, we’ve got into a groove, and most of the time we use the same preamble to load the things we know make this work easier. You can write your scraper in whatever language you prefer, but we write almost all of ours in Ruby. So in many cases our scraper scripts start like this:

#!/bin/env ruby
# encoding: utf-8

require 'scraperwiki'
require 'nokogiri'
require 'pry'
require 'open-uri/cached'

The scraperwiki gem provides access to all the magic of morph.io, including the implicit database that is available for every scraper. The nokogiri gem provides powerful HTML parsing tools.

There are often two routines in each scraper. It won’t always be the case, but more often than not it’s worked for us:

  • scrape_list(url) — parses the list of politicians, isolating each one by name

  • scrape_person(url) — parses the data for each individual politician

Here’s an example of a scraper we’re using to get the data for the US Virgin Islands: github.com/tmtmtmtm/us-virgin-islands-legislature/blob/master/scraper.rb.

That’s hitting www.legvi.org/index.php/senator-marvin-blyden and you can see on the left hand side of that page a list of senators. If you look at the underlying HTML, you’ll see that list is in <div class="mod-inner"> — specifically the <a> tags within the <li> tags within that. This is isolated with a single call to nokogiri:

def scrape_list(url)
  noko = noko_for(url)
  noko.css('.mod-inner li a').each do |a|
    mp_url = URI.join url, a.attr('href')
    scrape_person(a.text, mp_url)

The data is collected in an hash called data, which is then saved to the database, using the :id field to distinguish each record. (As we’re only scraping information for a single period here, that will be unique.) Sometimes there will be an obvious value on the page that can be used as an identifier, but here we construct it from the URL:

def scrape_person(name, url)
  noko = noko_for(url)
  data = {
    id: url.to_s.split("/").last,
    name: name.sub('Senator ', ''),
    image: noko.css('img[src*="/Senators/"]/@src').text,
    source: url.to_s,
  data[:image] = URI.join(url, data[:image]).to_s unless data[:image].to_s.empty?
  ScraperWiki.save_sqlite([:id], data)

(You might be wondering why we’re gathering so little information here. Didn’t we say that we needed things like area, politicial party, etc. too? Well, that’s because we’re getting that information from a separate scraper, using the Wikipedia page of the election results, and then merging the two together as described above.)

Data and terms

The politician’s data is being stored in a table called data (actually this is the default behaviour for morph.io).

But EveryPolitician also needs to know the names and dates of the legislative period(s) / political term(s) that this data applies to. You could simply tell us that, but the best approach is to store them in a terms table alongside the data table, and then we can read it automatically.

This easy to do, you just provide a the third parameter to the call to save_sqlite:

ScraperWiki.save_sqlite([:id], term_data, 'terms')

You can see an explicit example of this in the scraper for Tuvalu.

Finally, EveryPolitician collects the data

When you’ve done all the work and scraper is written, EveryPolitician retrieves the data you’ve created by pulling it down from the URL you provide. The actual mechanism is driven by Rake tasks, and doesn’t really affect the way you choose to write your scraper. However, if you look at the instructions.json file that is used to direct EveryPolitician when it collects and merges data from different sources, you can see how it pulls data in from scrapers that effectively map to the different sources. Here’s the example for Luxembourg.