Geocoding Using the Google Maps Geocoder via OpenRefine

On my to do list in the data education game is an open-ended activity based on collecting examples of how to do common tasks using a variety of different tools and techniques. One such task is geo-coding, getting latitude and longitude co-ordinates for a location based on it’s address or postcode. A couple of recent posts on the School of Data blog provide an Introduction to Geocoding and a howto on geocoding in Google spreadsheets. Here’s another approach, geocoding using OpenRefine [download and install OpenRefine] (see also Paul Bradshaw’s related post from a couple of years ago: Getting Full Addresses for Data from an FOI Response (Using APIs)).

First, we need to grab some data. I’m going to use a data file from the Food Standards Agency that contains food standards ratings for food outlets within a particular local council area. This data file is a relatively simple XML format that OpenRefine should have no trouble handling… Let’s get the data directly from the data file’s web address:

GetSomeDataIntoRefine

OpenRefine will now download the file and give us a preview. What we need to do now is highlight an example of a record element – OpenRefine will create one row for each such record in the original file.

Highlight the record element

(For the purposes of this example, it might have been easier to just select the postcode element within each establishment record.)

Click “Create Project” in the top right corner, and the data will be imported into a tabular view:

wehavedataIncluding postcodes

If we look along the columns, we notice a postcode element- we can use that as the basis for geocoding. (Yes, I know, I know, this dataset already has latitude and longitude data, so we don’t really need to get it for ourselves;-) But I thought this datafile was a handy source of postcodes;-) What we’re going to do now is ask OpenRefine to create a new column that will be populated with data from the Google maps geocoding API that gives the latitude and longitude for each postcode:

weshalladdacolumn

We can pass a postcode to the Google maps geocoder using a URL (web address) constructed as follows:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=POSTCODE"

(We could force the locale – that the postcode is in the UK – by using "http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=POSTCODE,UK".)

*IF YOU WANT OPEN GEO-CODED DATA, GOOGLE MAPS WON’T CUT IT… *

Note also that we need to “escape” the postcode so that it works in the URL (this essentially means handling spaces in the postcode). To do this, we will construct a URL using the follow GREL programming language statement:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address="+escape(value,"url")

or alternatively:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address="+escape(value,"url")+",UK"

What this does is build up the URL from the value of the cells in the column from which our new column is derived. The escape(value,"url") command makes sure that the postcode value is correctly encoded for use in a url.

generate the geocoding URL

We can now let OpenRefine construct a URL containing the postcode for each row in the data file and load the requested data in from the URL (the throttle delay sets the rate at which OpenRefine will make repeated calls to the API – many APIs have a limit on both the rate at which you can call them (once a second, for example), as well as a limit on the total number of calls per hour, or day). The returned data will be added to the appropriate row in the new column:

ecample geo

The looks a bit confusing, doesn’t it? However, it does have some structure to it. Here’s how the data unpacks if we lay it out a bit better:

{ "results" : [
   { "address_components" : [
     { "long_name" : "NG4 4DL",
       "short_name" : "NG4 4DL",
       "types" : [ "postal_code" ] 
     },
     { "long_name" : "Gedling",
       "short_name" : "Gedling",
       "types" : [ "locality", "political" ]
     },
     { "long_name" : "Nottinghamshire",
       "short_name" : "Notts",
       "types" : [ "administrative_area_level_2", "political" ]
     },
     { "long_name" : "United Kingdom",
       "short_name" : "GB", "types" : [ "country", "political" ] 
     },
     { "long_name" : "Nottingham",
       "short_name" : "Nottingham",
       "types" : [ "postal_town" ] 
     }
     ],
     "formatted_address" : "Gedling, Nottinghamshire NG4 4DL, UK",
     "geometry" : {
         "bounds" : {
             "northeast" : { "lat" : 52.97819320, "lng" : -1.09096230 },
             "southwest" : { "lat" : 52.9770690, "lng" : -1.0924360 } 
         },
         "location" : {
           "lat" : 52.97764640,
           "lng" : -1.09182150 
         },
         "location_type" : "APPROXIMATE",
         "viewport" : {
           "northeast" : { "lat" : 52.97898008029150, "lng" : -1.090350169708498 },
           "southwest" : { "lat" : 52.97628211970850, "lng" : -1.093048130291502 }
         }
       },
       "types" : [ "postal_code" ] 
    }
 ], "status" : "OK" }

(For a little more insight into the structure of data like this, see Tech Tips: Making Sense of JSON Strings – Follow the Structure.)

So OpenRefine has pulled back some data – we can now generate another column or two to extract the latitude and longitude the Google Maps geocoding API sent back:

colfromcol

We need to find a way of extracting the data element we want from the data we got back from the geocoding API – we can use another GREL programming language statement to do this:

openrefine jsonparse

Here’s how to construct that statement (try to read the following with reference to the laid out data structure shown above…):

  • value.parseJson() parses the data text (the value of the cell from the original column) into a data element we can work with programmatically
  • value.parseJson().results[0] gets the the data in the first “result” (computer programmes generally start a count with a 0 rather than a 1…)
  • value.parseJson().results[0][“geometry”] says get the “geometry” element from the first result;
  • value.parseJson().results[0][“geometry”][“location”] says get the location element from within that geometry element;;
  • value.parseJson().results[0][“geometry”][“location”][“lat”] says get the latitude element from within the corresponding location element.

Here’s what we get – a new column containing the latitude, as provided by the geocoder API from the postcode:

newcolgeo

Hopefully you can see how we would be able to create a new column containing the longitude: go back to the Googlegeo column, select Edit column – Add column based on this column for a second time, and construct the appropriate parsing statement (value.parseJson().results[0][“geometry”][“location”][“lng”]).

If OpenRefine is new to you, it’s worth reviewing what steps are involved in this recipe:

– loading in an XML data file from a web address/URL;
– generating a set of URLs that call a remote API/data source based on the contents of a particular column, and loading data from those URLs into a new column;
– parsing the data returned from the API to extract a particular data element and adding it to a new column.

Many of these steps are reusable in other contexts, of course…;-) (Can you think of any?!)

PS Tom Morris pointed out to me (on the School of Data mailing list) that the Google Maps geocoding API terms of use only allow the use of the API for geocoding points in order to locate them on a Google Map… Which is probably not what you’d want this data for…;-) However, the recipe is a general one – if you can find another geocoding service that accepts an escaped address term as part of a URL and returns geo-data in a JSON format, you should be able to work out how to use that API instead using this self-same recipe… (Just build a different URL and parse the returned JSON appropriately…)

I’ll try to keep a list of example API URLs for more open geocoders here:

http://open.mapquestapi.com/nominatim/v1/?format=json&q=London (no postcodes?)
http://mapit.mysociety.org/postcode/MK7%206AA (UK postcode geocoder)

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

4 thoughts on “Geocoding Using the Google Maps Geocoder via OpenRefine”

  1. Hello Tony, Thanks for a great post. Very clear and with just a bit more help with getting my JSON output formatted to be able to discern the elements and objects — I was able to accomplish getting latitude and longitude teased out of a similar geolocation web service result set. Thanks, Deb.

    1. Hi Deb – Glad to hear the method worked for you:-) Was there anything I should have mentioned/explained (but didn’t) that would have made it clearer? Were there any particularly ambiguous bits? Out of interest, what was the other geolocation service you used?

  2. Hi Tony, thanks very much – found this really helpful. I keep hitting a wall though when trying to extract postcodes. I’ve worked out I need this…

    value.parseJson().results[0][“address_components”][5][“long_name”]

    …but the problem is, 5 sometimes needs to be changed to 6, depending on the source address. For some records when I use 5, something else is extracted, such as the town or country. For these records, changing the 5 to 6 seems to fix the problem.

    Is there a way to force extraction of JUST the postcode, regardless of whether it’s “5” or “6”?

    I hope that’s clear…?

    1. Hi Henry – is there another crib you can use? For example, is the postcode always the last element, or the last but one element? If so – and I havenlt tried this – maybe you could index from the back of the list, using [-1] for the last element, [-2] for the last but one element. (Not sure if this will definitely work, but many languages do support this sort of indexing;-)

Comments are closed.