One of the most frequently encountered ways of sharing small datasets is in the form of Excel spreadsheet (.xls) files, notwithstanding all that can be said In Praise of CSV;-) The natural application for opening these files is Microsoft Excel, but what if you don’t have a copy of Excel available?
There are other desktop office suites that can open spreadsheet files, of course, such as Open Office. As long as they’re not too big, spreadsheet files can also be uploaded to and then opened using a variety of online services, such as Google Spreadsheets, Google Fusion Tables or Zoho Sheet. But spreadsheet applications aren’t the only data wrangling tools that can be used to open xls files… Here are a couple more that should be part of every data wrangler’s toolbox…
(If you want to play along, the file I’m going to play with is a spreadsheet containing the names and locations of GP practices in England. The file can be found on the NHS Indicators portal – here’s the actual spreadsheet.)
Firstly, Google Refine. Google Refine is a cross-platform, browser based tool that helps with many of the chores relating to getting a dataset tidied up so that you can use it elsewhere, as well as helping out with data reconcilation or augmenting rows with annotations provided by separate online services. You can also use it as a quick-and-dirty tool for opening an xls spreadsheet from a URL, knocking the data into shape, and dumping it to a CSV file that you can use elsewhere. To start with, choose the option to create a project by importing a file from a web address (the XLS spreadsheet URL):
Once loaded, you get a preview view..
You can tidy up the data that you are going to use in your project via the preview panel. In this case, I’m going to ignore the leading lines and just generate a dataset that I can export directly as a CSV file once I’ve got the data into my project.
If I then create a project around this dataset, I can trivially export it again using a format of my own preference:
So that’s one way of using Google Refine as a simple file converter service that allows you to preview and to a certain extent shape the data in XLS spreadsheet, as well as converting it to other file types.
The second approach I want to mention is to use a really handy Python software library (xlrd – Excel Reader) in Scraperwiki. The Scraperwiki tutorial on Excel scraping gives a great example of how to get started, which I cribbed wholesale to produce the following snippet.
import scraperwiki import xlrd #cribbing https://scraperwiki.com/docs/python/python_excel_guide/ def cellval(cell): if cell.ctype == xlrd.XL_CELL_EMPTY: return None return cell.value def dropper(table): if table!='': try: scraperwiki.sqlite.execute('drop table "'+table+'"') except: pass def reGrabber(): #dropper('GPpracticeLookup') url = 'https://indicators.ic.nhs.uk/download/GP%20Practice%20data/summaries/demography/Practice%20Addresses%20Final.xls' xlbin = scraperwiki.scrape(url) book = xlrd.open_workbook(file_contents=xlbin) sheet = book.sheet_by_index(0) keys = sheet.row_values(8) keys = keys.replace('.', '') print keys for rownumber in range(9, sheet.nrows): # create dictionary of the row values values = [ cellval(c) for c in sheet.row(rownumber) ] data = dict(zip(keys, values)) #print data scraperwiki.sqlite.save(table_name='GPpracticeLookup',unique_keys=['Practice Code'], data=data) #Uncomment the next line if you want to regrab the data from the original spreadsheet reGrabber()
You can find my scraper here: UK NHS GP Practices Lookup. What’s handy about this approach is that having scraped the spreadsheet data into a Scraperwiki database, I can now query it as database data via the Scraperwiki API.
(Note that the Google Visualisation API query language would also let me treat the spreadsheet data as a database if I uploaded it to Google Spreadsheets.)
So, if you find yourself with an Excel spreadsheet, but no Microsoft Office to hand, fear not… There are plenty of other tools other there you can appropriate to help you get the data out of the file and into a form you can work with:-)
PS R is capable of importing Excel files, I think, but the libraries I found don’t seem to compile onto Max OS/X?
PPS ***DATA HEALTH WARNING*** I haven’t done much testing of either of these approaches using spreadsheets containing multiple workbooks, complex linked formulae or macros. They may or may not be appropriate in such cases… but for simple spreadsheets, they’re fine…