Archive for April 2010
A recent post on the journalism.co.uk site asks: How much computer science does a journalist really need?, commenting that whilst coding skills may undoubtedly be useful for journalists, knowing what can be achieved easily in a computational way may be more important, because there are techies around who can do the coding for you… (For another take on this, see Charles Arthur’s If I had one piece of advice to a journalist starting out now, it would be: learn to code, and this response to it: Learning to Think Like A Programmer.)
Picking up on a few thoughts that came to mind around a presentation I gave yesterday (Web Lego And Format Glue, aka Get Yer Mashup On), here’s a slightly different take on it, based on the idea that programming doesn’t necessarily mean writing arcane computer code.
Note that a lot of what follows I’d apply to librarians as well as journalists… (So for example, see Infoskills for the Future – If You Can’t Handle Information, Get Out of the Library for infoskills that I think librarians as information professionals should at least be aware of (and these probably apply to journalists too…); Data Handling in Action is also relevant – it describes some of the practical skills involved in taking a “dirty” data set and getting it into a form where it can be easily visualised…)
So here we go…. An idea I’ve started working on recently as an explanatory device is the notion of feed oriented programming. I appreciate that this probably already sounds scary geeky, but it’s a made up phrase and I’ll try to explain it. A feed is something like an RSS feed. (If you don’t know what an RSS feed, this isn’t a remedial class, okay… go and find out… this old post should get you started: We Ignore RSS at OUr Peril.)
Typically, an RSS feed will contain a set of items, such as a set of blog posts, news stories, or bookmarks. Each item has the same structure in terms of how it is represented on a computer. Typically, the content of the feed will change over time – a blog feed represents the most recent posts on a blog, for example. That is, the publisher of the feed makes sure that the feed has current content in it – as a “programmer” you don’t really need to do anything to get the fresh content in the feed – you just need to look at the feed to see if there is new content in it – or let your feed reader show you that new content when it arrives. The feed is accessed via a web address/URL.
Some RSS feeds might not change over time. On WriteToReply, where we republish public documents, it’s possible to get hold of an RSS version of the document. The document RSS feed doesn’t change because the content of the document doesn’t change), although the content of the comment feeds might change as people comment on the document.
A nice thing about RSS is that lots of things publish it, and lots of things can import it. Importing an RSS feed into an application such as Google Reader simply means pasting the web address of the feed into a “Subscribe to feed” box in the application. Although it can do other things too, like supporting search, Google Reader is primarily a display application. It takes in RSS feeds and presents them to the user in an easy to read way. Google Maps and Google Earth are other display applications – they display geographical information in an appropriate way, a way that we can readily make sense of.
So what do we learn from this? Information can be represented in a standard way, such as RSS, and displayed in a visual way by an application that accepts RSS as an input. By subscribing to an RSS feed, which we identify by a fixed/permanent web address, we can get new content into our reader without doing anything. Subscribing is just a matter of copying a web address from the publisher’s web site and pasting it into our reader application. Cut and paste. No coding required. The feed publisher is responsible for putting new content into the feed, and our reader application is responsible for pulling that new content out and displaying it to us.
One of the tools I use a lot is Yahoo Pipes. Yahoo Pipes can take in RSS feeds and do stuff with it; it can take in a list of blog posts as an RSS feed and filter them so that you only get posts out that do – or don’t – mention cats, for example. And the output is in the form of an RSS feed…
What this means is that if we have a Yahoo pipe that does something we want in computational terms to an RSS feed, all we have to do is give it the web address of the feed we want to process, and then grab the RSS output web address from the Pipe. Cut and paste the original feed web address into the Pipe’s input. Cut and paste the web address of the RSS output from the pipe into our feed reader. No coding required.
Another couple of tools I use are Google Spreadsheets (a spreadsheet application) and Many Eyes WIkified (an interactive visualisation application). If you publish a spreadsheet on Google docs, you can get a web address/URL that points to a CSV (comma separated variable) version of the selected sheet. A CSV file is a simple text file where each spreadsheet row is a represented as a row in the CSV structured text file; and the value of each cell along a row in the original spreadsheet is represented as the same value in the text file, separated from the previous value by a comma. But you don’t need to know that… All you do need to know is that you can think of it as a feed… With a web address… And in a particular format…
Going to the “Share” options in the spreadsheet, you can publish the sheet and generate a web address that points to a range of cells in the spreadsheet (eg: B1:D120) represented as a CSV file. If we now turn to Many Eyes Wikified, I can provide it with the web address of a CSV file and it will go and fetch the data for me. At the click of a button I can then generate an interactive visualisation of the data in the spreadsheet. Cut and paste the web address of the CSV version of the data in a spreadsheet that Google Docs will generate for me into Many Eyes Wikified, and I can then create an interactive visualisation using the spreadsheet at the click of a button. Cut and paste a URL/web address that is generated for me. No coding required.
As to where the data in the spreadsheet came from? Well maybe it came from somewhere else on the web, via a URL? Like this, maybe?
So the model I’m working towards with feed oriented programming is the idea that you can get the web address of a feed which a publisher will publish current content or data to, and paste that address in an application that will render, or display the content (e.g. Google Reader, Many Eyes Wikified) or process/transform that data on your behalf.
So for example, Google Reader can transfrom an HTML table to CSV for you; (Google spreadsheets also lets you do all the normal spreadsheet things, so you could generate one sheet from another sheet using whatever spreadsheet formulae you like, and publish the CSV representation of that second sheet). Or in Yahoo Pipes, you can process an RSS feed by filtering its contents so that you only see posts that mention cats.
Yahoo Pipes offers other sorts of transformation as well. For example, in my original Wikipedia scraping demo, I took the feed from a Google spreadsheet and passed it to Yahoo Pipes where I geocoded city names and let pipes generate a map friendly feed (known as a KML feed) for me. Copying the web address of the KML feed output from the pipe and pasting it into Google Maps means I can generate an embeddable Google map view of data originally pulled from Wikipedia:
Once you start to think of the world in this way:
- where the web contains data and information that is represented in various standard ways and made available via a unique and persistent web address,
- where web applications can accept data and content that is represented in a standard way given the web address of that data,
- where web applications can transform data represented at one web address in one particular way and republish it in another standard format at another web address,
- or where web applications can take data represented in a particular way from one web adress and provide you with the tools to then visualise or display that data,
then the world is your toolbox. Have URL, will travel. All you need to know is which applications can import what format data, and how they can republish that data for you, whether in a different format, such as Google spreadsheets taking an HTML table from Wikipedia and publishing it as a CSV file, or as a visualisation/human friendly display (Many Eyes Wikified, Google Reader). And if you need to do “proper” programmer type things, then you might be able to do it using a spreadsheet formula or a Yahoo Pipe (no coding required…;-)
What OU Facebook App would you like to see? Here’s a chance to get it made…
Two and half years ago, as part of an informally convened skunkworks team, we released a couple of Open University related Facebook apps inspired, in part, by observing student behaviour in online course forums.
Current privacy fears aside (?!;-), the apps are getting another push as part of an announcement about a “Design an OU Facebook App” Competition (Share your Facebook app ideas for chance to win OU vouchers):
The rules are simple: tell us what new app you’d like to see us build on Facebook.
And in return? For the winner of our competition, which runs until June 8, 2010, there’s £100 course vouchers, and, even more exciting perhaps, the chance to see your app built, with your name on the developer credits.
Although there are only a handful ideas posted so far it’s interesting to see how some of them already tally with ideas we had for additional apps at the time Course Profiles and My OU Story were built. (Liam, maybe we should dig out the old email exchanges we had bouncing round new app ideas, and submit them to the current competition?! Heh heh;-)
The competition is being managed through an online suggestion-and-voting system that appeared on the OU Platform site earlier this year and which is being used to solicit ideas for new courses from any one who registers on the site (Platform is open to all, not just OU students, staff, and alumni).
The Platform team seem to have really got into the idea of competitions, so presumably it works as a marketing exercise. Just out of interest, are there any other HEIs out there that run competitions in a similar way?
Another week, another presentation… I dunno about death by Powerpoint on the audience side, but even though I’ve started finding ways of reusing slides, it still takes me forever and a day (well, 4-6 hrs), to put a slide deck together… One day – one day – I’ll have to produce a presentation I can just give over and over again… ;-)
Anyway, here are slides for a presentation I’m due to give tomorrow (Thursday) at the University of Portsmouth. The plan is for a 1 hr “lecture”, and a 1 hr hands-on workshop session. The slides are for the talk – but also set the scene for the practical activity…
So what’s the practical? (For anyone reading this in advance of attending the session, I suggest you get yourself sorted with accounts for Google/Google Spreadsheets, Yahoo/Yahoo Pipes and IBM/Many Eyes and Many Eyes Wikified.) As time is tight, I suggest the best way in is to just try recreating some of the demos shown in the presentation above, and then going from there…. A good alternative would be to start working through this intro to Yahoo Pipes:
For the more adventurously minded, looking through the pipework category on this blog might provide a little more inspiration…
If it’s Google Spreadsheets hacks you’re after, searching for Google spreadsheet import formula should turn up some example posts…
Many Eyes and Many Eyes Wikified demos can be found by search for “Many Eyes”
I was hoping to put together a couple of rather better structured self-paced workshop examples, but I’m afraid I’ve run out of time for today…:-(
A couple of weeks ago I did a phone interview for the OU’s DISCO project – OU Digital Scholarship Portal. From what I remember of the call, it rambled over many and varied topics, including possible metrics that might be taken into account when putting together promotion cases that include a demonstration of excellence in digital scholarship (whatever that is…).
Anyway, today I wasted a day – a whole day – updating my CV and writing stuff that seems to be the wrong stuff for an OU promotion case. Ever the reflective sort(?!), here are some observations I came away with:
- Slideshare is my presentation memory; I need to get in the habit of recording the date and event a presentation is for when I upload it to make it easier to list the presentations I’ve given. Alternatively, it might make sense to use a calendar to record the dates and events I’ve spoken at and then use the iCal feed to display the result;
- not writing formal academic papers means I have nothing to cite that t’committee would accept as credible. However, I have given quite a few interviews over the last couple of years to folk writing formal reports, doing research projects, or writing books. I’ve also participated in a few Delphi exercises and attended invitation only workshops and brainstorming sessions, as well as being invited to speak at events folk pay money to attend. Here’s part of what I wrote on this topic in my draft case: I have all but given up on formal academic publishing, in favour of short-form informal blog posts, occasional articles, and interviews for people who are writing long-form pieces (books, reports) which typically offer a greater or more immediate reach than scholarly articles in refereed journals, or benefit from a greater impact or better targeted audience than I could personally reach. The problem? That whilst I regularly participate in interviews and conversations with people writing official reports, books, etc as well as participating in Delphi Exercises[,] I’m not very good at keeping records of these or tracking down citations…
What occurs to me, then, is that I am more interested in direct or immediate communications of ideas as part of an ongoing process of learning and discovery (as part of a conversation, to use that well-worn and increasingly pointless phrase…) rather than archiving ideas for the record. (This also reflects my cynical attitude that the majority of stuff that appears in the formal record is not, to my mind, a contribution to anything other than the bulk of a journal sold for profit…)
If I’m going into the archive, someone else can put me there… But for the promotion case, acknowledgements are the lowest of the low in terms of academic credibility, rivaled only by (pers comm). Which is a shame – because one of the quotes I carry with me (but unfortunately can’t credit because I can’t for the life of me remember who said it, except that it was someone from outside the OU giving a seminar in the OU), that the whole point of being an academic is to have interesting conversations.
Anyway, the reason why I started to write this post is this: if the digital scholarship folks want metrics around how effective a scholar’s online activities are, it may be worth looking at tangible outcomes in the real world – such as invitations (e.g. to speak at seminars and workshops) and acknowledgements (e.g. in books, articles and reports). This conversion from informal online activity to a formal request in physical space is where the “citation” is evidenced.
And as Stephen Downes writes in a recent Half an Hour post:
By sharing my work freely, people around the world are able to see it, and they willingly pay for me to come and speak to them. I do not collect speaker fees, but I do require that they pay my expenses, because otherwise I could not afford to travel to their cities. We both benefit, because I then use these trips to produce work that we share with other people around the world, and the cycle continues.
You might think, it’s not a very good deal for some organization to pay several thousand dollars to fly me to their city. But consider the cost were they to buy books from me instead. They could get maybe 30 or 40 copies of an academic text for the same amount. This way, they get all my content I ever create for free, as many copies as they would ever need. [Paying For Art]
If the point of publishing is to communicate ideas, then presentations count. And if the refereeing process is to guarantee quality, then being given an invitation to speak also reflects reputation brownie points and an element of trust on the part of the person responsible for extending the invitation, even if they are not explicitly evaluating the actual content of a presentation a priori.
As to the benefits accruing to Stephen’s employer: “[t]hey get the reputation from sponsoring my work” as well as influencing whatever he is working on.
I’m not sure what metrics Stephen uses if he goes through an annual staff development/appraisal cycle (I thought I’d read something he’d written on this before, but I can’t find it if he did…?) but it would be interesting to see them…
PS today has been crap day. The only enjoyable part has been this bit – thinking about how I might be able to build a living CV… Paraphrasing Fermat, if I didn’t have to walk the dog just now, I’d have been able to build the neatest little demonstration site for this, which would include parsing the events out of my CV into a spreadsheet, and then using my Maintaining a Google Calendar from a Google Spreadsheet recipe to get them into a calendar;-)
Over the weekend, a bit of random googling with filetype:csv OR filetype:xls limits turned up an archive of what looked like a complete set of UK general election results from 2005 on the Electoral Commission website.
(By the by, I also came across the National Digital Archive of Datasets (NDAD), which looks like it could be interesting: “The National Digital Archive of Datasets (NDAD) preserves and provides online access to archived digital datasets and documents from UK central government departments. Our collection spans 40 years of recent history, with the earliest available dataset dating back to about 1963.”)
Anyway, the election results data was presented in a CSV file with a format that looked like:
The first thing I wanted to try to do with it was to produce an election votes treemap (e.g. along similar lines to Work In Progress – Voting Treemaps). The target visualisation tool was Many Eyes, simply because it’s quick and easy to use (I think a little bit of setting up is still required to use the wrapper I’ve placed around the JIT treemaps…:-(
In order to use the Many Eyes treemap to it’s full effect, I needed each row to start with the constituency name. Looking at the above screengrab of the Electoral Commission data, we see that the constituency is only explicitly stated for the winning candidate. So what’s the best way of filling in the blanks?
In a couple of recent talks, I’ve made the claim that library folk should be in the business of teaching on effective data handling skills, and this is a good case in point. So what would you do?
The naive response is to go into the spreadsheet and by hand click and drag each separate constituency name into the blanks. With over six hundred constituencies, that should take you how long..?!
Two rather more efficient ways came to my mind. Firstly, use a Python script to read in the CSV file, check to see if the constituency field is populated, and if it isn’t, populate it with the constituency name from the previous row:
import csv filein='candidates_49576.csv' fileout='election2005.csv' fout=open(fileout,"wb+") csv_file = csv.writer(fout) file_obj=open(filein,"rb") for line in csv.reader(file_obj, delimiter=',', # Your custom delimiter. skipinitialspace=True): # Strips whitespace after delimiter. if line: # Make sure there's at least one entry. if line=='': line=last else: last=line csv_file.writerow( line ) fout.close()
As ever, most of the above code isn’t mine, it’s stolen (in this case, from Stack Overflow).
The second approach I thought of was to use a Google Apps script… (this exercise is left to the reader… feel free to post a copy of the script, or a link to a spreadsheet containing it, in the comments;-)
The third approach was to use a spreadsheet formula and create a new, fully populated constituency column. Here’s one formula that I thought of:
That is, if the cell to the left is empty, copy the contents of the cell above…
So now I had a copy of the data in the form I needed, I could just copy and paste it from a spreadsheet into Many Eyes
You can see a copy of a derived treemap here. The view below uses colour to show the constituencies where the winning candidate won more than half of the valid votes cast in that constituency:
The next thing I wanted to do was to look at the actual number of votes a winning candidate beat the second placed candidate by, and also view this as a percentage of the total number of valid votes cast in that constituency. To do this, I created three new columns:
- one containing the total number of valid votes cast;
- one containing the number of votes the winning candidate beat the second placed candidate by;
- one containing the number of votes the winning candidate beat the second placed candidate by divided by the total number of valid votes cast, given as a percentage.
So how would you calculate those items?
Go on – think about it – how would you do it?
Here’s how I did it…
To count the total number of votes cast in a constituency, I used the formula:
(Note that before yesterday, I’d never heard of this formula… I just went googling for help on something like excel sum similar items…;-)
Select and drag this cell to populate the column…
For the vote majority formula, I only wanted this to apply to the winning candidate, so I put a trap in to see whether the constituency from the previous row is the same as the current row. If it isn’t, then because I assume the first candidate is the winning candidate, I can find the difference between the current row’s votes and that of the second placed candidate, which is immediately below:
Once again, select and drag this cell to populate the column…
I guess I really should also trap just to check that the row below is in the same constituency in case someone (the Speaker?) stands unopposed:
The percentage value is found simply by dividing the absolute number of votes majority by the total number of valid votes cast and multiplying by 100.
I uploaded the data to Many Eyes here.
So now we can look to see the number of votes majority an individual had, for example, compared to the turn out:
This chart brings to mind two other things to me. Firstly it shows all the candidates – it would be more convenient in this case to just show the data for the winning candidates. Secondly, it suggests that we might also plot the number of votes behind the winning candidate each of the other candidates was… but that’s for another day…
What I will do though is consider the first point, specifically how we might just look at the data for the winning candidates… One way of achieving this is to use Google Spreadsheets as a database using the Google spreadsheets API as described in First Steps Towards a Generic Google Spreadsheets Query Tool, or At Least, A Guardian Datastore Interactive Playground.
The query also generates a link to a CSV version of the data. It’s not hard then to pop the data into Many Eyes so now we can visualise the winning MP majorities more easily – here we see the votes ahead of the second placed candidate vs. the valid number of votes cast:
We can also use the Datastore explorer to interrogate the database. For example, which MPs are defending a slim majority in terms of absolute number of votes?
Okay, that’s more than enough for now – this was only supposed to be a half hour post… What I hope this post has done, though, is show how I mix and match tools and techniques as a I wander around a data set, as well as happily trawling the net to look for solutions to problems that I’m sure someone else must have solved before (such as summing a number of separate items labeled the same way in an Excel spreadsheet).
PS Somewhere along the line, I think when looking at majorities, a couple of points had negative values – the visualisation had turned up missing rows, like this one:
One of the things I experimented with a long time ago was the ability to use an RSS feed to power a presentation (e.g. Feedshow Link Presenter – Testing Audience Synch). The idea was that that you should be able to bookmark a set of webpages using a social bookmarking tool, and then take the RSS feed of those bookmarked links and use it to drive a presentation; the presentation itself would be a walkthrough of the bookmarked pages.
Anyway, not so long ago, the Delicious social bookmarking service started to offer a similar service: Browse These Bookmarks, so that’s all well and good..:-)
One of the things that I had on the Feedshow Presenter to do list (and I’m not sure whether I ever coded this up or not) was to be able to display any description text saved with a bookmark before showing the contents of the bookmarked page. This in turn suggests another way of using a feed powered presentation tool – as a vehicle simply for displaying text elements from an RSS feed in a presentation like format.
Now I know that death by Powerpoint is often brought on by text heavy slides, but sometimes you may need to chat around text; and sometimes, splashing the text you want to talk around on a screen might be a handy thing to do…
So what I’m thinking is, if you want to talk around a document, then maybe talking around the document at a paragraph level is a handy thing to be able to do.
And a good place to find paragraph level feeds is from something like WriteToReply…
So for example, if you look at the Digital Economy Act on WriteToReply, and go to a particular page such as http://writetoreply.org/deact/subscriber-appeals/, you can get an RSS feed from that page by constructing a URL of the form http://writetoreply.org/deact/feed/paragraphlevel/subscriber-appeals/
(Note there’s a minor glitch at the moment – the title of the feed itself is incorrect…)
So to discuss each paragraph in turn from that page, all we need to so is view the feed in Google Reader Play.
To make things easier, I’ve created a couple of bookmarks (bootstrapped from my “Get Current URL” pattern bookmarklet generator).
Firstly, given any RSS feed, here’s a bookmarklet for viewing it in Google Reader Play:
Secondly, given a page from a document hosted on WriteToReply, (not the RSS feed – the actual page; such as this one) this bookmarklet will construct the paragraph level page feed and pass it to Google Reader Play:
So there you have it, another way of supporting discussion around documents hosted on WriteToReply :-)