Getting to Grips with – Part 1: Files on the Web

Part 1 in a possible series of posts of things you probably need to understand in order to make any sense of the other posts that appear in, the blog….

A lot of the mashup recipes I post in involve taking stuff from one application and using it in another. So one of the first things you need to realise about the web is that you can have different sorts of documents on it.

You maybe don’t realise it, but you probably already have a notion of filetypes on a computer. That is, when you look at a file in a folder or oor on the desktop, you know from the icon used to represent it what application it will open in when you (double) click on it:

For example, in the above image I know that clicking on the html file will open that document in my (Flock) browser, that the PDF document will open in Acrobat, that the .dot file is a graph description for Graphviz, that the .ics file will open in iCal, and so on.

I know that, I just know it… You probably have a similar implicit understanding of what happens on different icons on your computer too.

Something you may (or may not) know is that the same file can often be opened in different applications. For example, I can view a text (.txt) file of the sort that normally opens on Windows computers in the Notepad application in a browser. (I also happen top know that if click on and drag a file such as a .txt file, or an image file, from a folder and drop it onto the icon on my desktop for my web browser, or drop it onto a web page displayed in my browser, the browser will display that file/image. If you didn’t know that, try it now… I can wait…)

So what does that tell us? Well for a start, it tells us that the same file can be opened in different applications, and also that any given application may be able to open or display several different file types. Applications can also often save files in different file types (from the “Save As…” or “Export” option on the File menu, usually), or publish them to a particular URL. So for example, I can save a file in Microsoft Word as a word document, as text (without the formatting), as HTML, and so on.

The online office applications work in a similar way – they can import files of particular file types from the desktop and also save files (using different file types) to the desktop, as the following menu options in Google Spreadsheets shows:

But online applications (and increasingly, desktop applications too) can also open or import files from the web.

So for example, if you upload a Word document on the web somewhere, and give me the URL, then can give that URL to Google docs and it will fetch the document from that URL, save it as a new Google document, and show it to me. The same is true for Google spreadsheets and Excel (.xls or .csv) documents. Upload the file to the web somewhere, give the the URL for where to download from it, and I can then either download that file, or give the URL to Google spreadsheets, whence it will create a new file and import a copy of the contents of your file-on-the-web into it. And if you know how, you can also republish that document, maybe using another filetype, to another URL on the web.

Which is where the first thing you need to know in order to understand some of the things I write about in comes in: documents come as different types; different applications (whether on the desktop or the web) know how to own (or import) certain document types; and different applications can save, or export documents in a variety of different types. And what’s more, for the web applications, they can often publish documents ‘virtually’ to a unique URL where anyone can grab the document from, often in a variety of formats (e.g. for the same spreadsheet to slightly different URLs for the CSV version, the XLS version, the HTML table preview version, and so on).

Know this, and you can start to see how information can flow around the web, being passed via a URI published by one application and then pulled in to another application. URIs become like jackplugs in an old analogue synth, allowing you to wire documents published by one application and import them into another.

Modular Gear by synthasy2000
“”Modular Gear” by synthasy2000

Why’s this important? Well, it means that if you know what filetypes a web application can import from a URL, what it can do with those files, and what different filetypes it can then publish to another URL, it means you can start to wire different applications together and pass documents automatically between them.

So for example, in my first (only?!) ‘hit’ mashup – Data Scraping Wikipedia with Google Spreadsheets – I knew that I could import some HTML into a Google spreadsheet, I knew that I could get CSV out of the spreadsheet, I knew that Yahoo Pipes could accept that CSV, I knew that I could use Yahoo Pipes to add geographical co-ordinate data to the information contained in the CSV file, I knew that Yahoo Pipes would then let me export the data in the KML format, and I knew that I could take the URL for the KML version of that file and paste it (and hence display it) in Google maps.

(As to how I knew that, well that’s another story…).

Or take another example: I know that Many Eyes Wikified can import live CSV data from a URL into a wiki data page, and that I can then (trivially) visualise that data in an interactive way. So this means that I know that if I can get data into a CSV form (by whatever means necessary), I should be able to visualise it in Many Eyes Wikified.

So to say it again, lesson 1 is simply this: different applications can open different filetypes from URLs, and can publish documents using different file types to different URLs. By taking the URL a document is published to, and using it in another web application as the address to import the file from, we can wire different applications together according to what we want to do to whatever was contained in the original file.

Sometimes I use the applications simply to transform a document from one filetype to another (for example, importing an XLS file into Google spreadsheets so that I can then get a a CSV filetype/representation of it out that I can then use in another application), sometimes it’s because I actually want to modify the file, or automatically annotate part of it in some way, using the functionality of a particular web application (such as adding geolocation co-ordinates to a data set in a Yahoo pipe).

But that’s as maybe… the *really* important thing (not the academic, abstract knowledge stuff…!) is knowing what formats play with what applications, and what those applications can do to, and with, the data you’re passing round…

…which is where I should probably include a matrix showing what applications work with what, and what those applications may be good for… But that’s valuable ‘off-the-top-of-MY-head’ knowledge, right, so why should I just give it away, rather than force you to spend hundreds of hours learning what works with what, or poring through the entrails of looking for different web recipes, application combinations, and mashup patterns…?;-)

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

3 thoughts on “Getting to Grips with – Part 1: Files on the Web”

  1. Nice one. Just the kind of approach I was hoping for. I like the jack plug analogy. I didn’t know that the data scraping wikipedia post was when you hit the big time. That’s also the first OUseful post I ever read (and dutifully followed and cloned with some self-satisfaction).

Comments are closed.