How To Create Wordcloud from a Twitter Hashtag Search Feed in a Few Easy Steps

So I was struggling for a quick hit blog post to publish today (busy:-(, but then I got a tweet from @paulbradshaw asking “Any ideas how you could make mashup showing the frequency of certain words in hashtagged tweets – e.g. tagcloud.”

Hmm – like this maybe?

create word cloud from hashtag feed

:-)

[NOTE – you need to encode the hashtag as %23 in the feed URI.]

I call this technique a screencaptutorial… (err….?!)

[UPDATE: I don’t think this hack works any more, at least not directly (I don’t think a link to the RSS feed is provided any more from the search results page. You can however construct a URL that will search or the 100 most recent tweets containing your search term(s): http://search.twitter.com/search.atom?q=SEARCHTERM&rpp=100&result_type=recent The rpp argument specifies the number of results per page, and the result_type argument gets you the most recent, rather than “most popular” tweets. Note that the SEARCHTERM needs escaping if it’s a multi-word search phrase for example. In this case, space characters get encoded as %20, and punctuation may also need encoding. Use this encoder set to “encodeURI” to encode the URL for you… ]

The screen capture was made using Jing, and the white background comes from an empty text editor document exploded to fill the screen.

For more info on manipulating Twitter search feeds, see Twitter Powered Subtitles for Conference Audio/Videos on Youtube.

PS I’m not sure whether the wordle app generates a static word cloud from a feed, or a more dynamic one? (That is, does it just grab the feed contents at the time the word cloud is created and use those to generate a one-hit word cloud, or does it keep sampling the the feed? If you want a live word cloud, then a better way is to import the feed into a Google spreadsheet, publish the spreadsheet, take a CSV output from it and drop it into Many Eyes wikified. Or create a web page of your own and generate the word cloud from the feed (maybe pulling it into the page as JSON via Yahoo pipe, so you can get around having to use a proxy to pull the feed into the page) using a word cloud javascript library such as Dynacloud, Cloudinizr or Cloudy.

Last Week’s Football Reports from the Guardian Content Store API (with a little dash of SPARQL)

A big :-) from me today – at last I think I’ve started to get my head round this mashup malarkey properly… forget the re-presentation stuff, the real power comes from using one information source to enrich another… but as map demos are the sine qua non of mashup demos, I’ll show you what I mean with a map demo…

So to start, here’s a simple query on the Guardian content store API for football match reports:

http://api.guardianapis.com/content/search?
filter=/football&filter=/global/matchreports&after=20090314&api_key=MYSECRETACTIVATEDKEY

It’s easy enough to construct the query URI using a relative date in the Yahoo pipe, so the query will always return the most recent match reports (in this case, matc h reports since “last saturday”):

It’s easy enough to use these results to generate an RSS feed of the most recent match reports:

Pulling the images in as Media RSS (eg media:group) elements means that things like the Google Ajax slide show control and the Pipes previewer can automatically generate a slideshow for you…

You can also get the straight feed of course:

A little bit of tinkering with the creation of the description element means we can bring the original byline and match score in to the description too:

Inspecting the API query results by eye, you might notice that a lot of the bylines have the form “John Doe at the Oojamaflip Stadium”:

Hmmm…

It’s easy enough to exploit this structural pattern to grab the stadium name using a regular expression or two:

I thien did a little experiment running the name of the stadia, and the name of the stadia plius football ground, UK through the Yahoo Location Extractor block to try to plot the sotries on map locations corresponding to the football ground locations, but the results weren’t that good…

…so I tweeted:

And got a couple of responses…

The XQuery/DBpedia with SPARQL – Stadium locations link looked pretty interesting, so I tweaked the example query on that page to return a list of English football stadia and their locations:

PREFIX p: <http://dbpedia.org/property/&gt;
PREFIX skos: <http://www.w3.org/2004/02/skos/core#&gt;
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt;
SELECT * WHERE
{?ground skos:subject <http://dbpedia.org/resource/Category:Football_venues_in_England&gt;.
?ground geo:long ?long.
?ground geo:lat ?lat.
?ground rdfs:label ?groundname.
FILTER (lang(?groundname) ='en').
}

and created a pipe to call dBpedia with that query (dbpedia example – English football stadium location lookup pipe):

Because I don’t know how to write SPARQL, I wasn’t sure how to tweak the query to just return the record for a given stadium name (feel tfree to comment telling me how ;-) – so instead I used a pipe filter block to filter the results instead. (This combination of search and filter can be a very powerful one when you don’t know how to phrase a particular qusry, or when a query language doesn’t support a search limit you want…

It was now a simple matter to add this pipe in to geocode the locations of the appropriate stadium for each match report:

So let’s recap – we call the Guardian content API for match reports since “last saturday” and construct a nice RSS feed from it, with description text that includes the byline and match score, as well as the match report. Then we pull out the name of stadium each match was played at (relying on the convention that seems to work much of the time that the byline records the stadium) and pass it through another pipe that asks DBpedia for a list of UK football stadium locations, and then filters out the one we want.

Tweak the location data to a form Yahoo pipes likes (which means it will create a nice geoRSS or KML feed for us) and what do we get? Map based match reports:

As I’ve show in this blog many times before, it’s easy enough to grab a KML feed from the More options pipe output and view the results elsewhere:

(Click on a marker on the google map and it will pop up the match report.)

So what do we learn from this? Hmmm – that I need to learn to speak SPARQL, maybe?!

PS @kitwallace has come up trumps with a tweak to the SPARQL query that will do the query by stadium name in one:
FILTER (lang(?groundname) =’en’ && regex(?groundname,’Old Trafford’)). Ta, muchly :-)

Filtering Guardian Data Blog/Google Spreadsheet Data With Yahoo! Pipes

In Using Many Eyes Wikified to Visualise Guardian Data Store Data on Google Docs I showed how to pull data from Google spreadsheets (uploaded there by the Guardian as part of their Data Store initiative and visualise it using Many Eyes Wikified.

Unfortunately, one of the downsides of using Many Eyes Wikified is that you can’t filter the imported data or select subsets of rows from it (nor can you choose to just ignore particular columns in the visualisation editor – which would be a really handy thing to be able to do, and probably quite easy to implement?)

So for example, when looking at the RAE 2008 Data, it might be quite useful to be able to just visualise the data for a single institution, such as the Open University. But how can we do this?

One way would be to find a way of searching the data in the spreadsheet, and then only extracting the rows that contained the search term – such as “Open University” – in the institution name column. But I don’t know if that’s possible in Google Spreadsheets, (though it is possible in Zoho DB, which supports all manner of queries in SQL like dialects; but that’s for another day…;-).

An alternative way is to grab the whole of the spreadsheet and then just filter out all the rows that don’t contain the filter term in the desired column… which is an idea that came into my mind last night on my way home, and which appears to work quite well, as this Filtering UK RAE 2008 CSV Data from Google Docs pipe shows:

If your pipe imported a CSV file, as this one does (in fact, if the items being pushed out of the pipe have a y:row attribute set giving a row/count number for each item) then you can get a CSV file out of the pipe too:

Which in this case means we can filter through just the RAE 2008 data for a particular institution, grab the CSV URL for that data, and pull it into Many Eyes wikified in the same way as described before:

Here it is:

And once we have a wiki data page for it, we can visualise it – I’ve popped some examples up here: OU RAE 2008 performance.

For example, how about a bubble diagram view showing our 4* performance across the various units of assessment:

(The number is the percentage of submissions in that unit of assessment achieving the specified grade.)

Or how about this one – an interactive bar chart showing the percentages of 3* and 4* submissions in each unit of assessment:

If you look back at the pipe interface, you’ll see you can use the same pipe to pull out the data for any of the HEIs:

The pipe output CSV URI also makes this clear:
http://pipes.yahoo.com/pipes/pipe.run?_id=6d9579da965cf16e883e363eb26e9e63&_render=csv&hei=Leicester

In the next post on this topic, I’ll show you how to create a rather more interesting RSS feed from this sort of pipe, including some Google chart URIs created dynamically within the pipe…

But for now, it’s time to go foraging for dinner…

Simple Embeddable Twitter Map Mashup

Earlier today, I was pondering the Digital Planet Listeners’ map on open2.net and the #digitalplanet Twitter feed we have running alongside it:

and I started to wonder whether there was a simple way of generating an embeddable map showing the location of people tweeting given a Twitter feed.

I couldn’t find one offhand, so here’s a simple pipe that will do the job: Simple Tweetmap pipe.

Here’s how it works: start off by grabbing a feed from Twitter, such as a Twitter search feed.

Using a Twitter feed URL as an input to the pipe, grab the feed and then find the Twitter username of each individual from the user’s Twitter URL. So for example, map http://twitter.com/psychemedia onto psychemedia.

We now call on another pipe that calls the Twitter API to get personal details for each user who has a Tweet in the feed.

Here’s how that embedded pipe works: Twitter location pipe (it should really be called “Twitter User Details” pipe).

First, construct a URI that ask the Twitter API for the user details associated a particular Twitter username, (e.g. using the construction http://twitter.com/users/show/USERNAME.json) then pull the data back as a JSON feed. Finally, just make sure only a single record is returned (there should only be one anyway).

So the embedded pipe passes back an annotation to the original feed with user details. One of the user details is the user’s location – so let’s geocode it:

Sending the output of the location finder to item.y:location allows the pipe to produce a well formed KML and geoRSS output, that can be displayed in a map, such as the Yahoo! Pipes output preview map:

We can grab the KML URL from the More Options output and display the feed in an embeddable Google map in the normal way (simply enter the KML URI in the Google maps search box and hit Search):

If you want to embed the map in your own page, just grab the embed code…

To summarise the hack, here’s a quick review of the whole pipe:

So now if you want to plot where people who have tagged their tweets in a particular way are tweeting from, you can do :-)

Next step is to persuade the open2 team to start archiving appropriately tagged tweets and displaying them on a cluster map view over time :-) We could maybe even link in a timeline using something like TimeMap, the MIT Simile timeline and Google Maps integration library…?

Creating Your Own Results Charts for Surveys Created with Google Forms

Just before Christmas, I volunteered to put together a questionnaire for a course on ICT (T209) that would informally and anonymously collect information from students about their mobile phone usage; the idea being that we could run similar surveys in the follow on course (T215 Communication and information technologies) in future years and give a students’ eye view of trends in behaviour around network connectivity and mobile devices; (the courses have large populations, so the results are potentially statistically significant).

The tool I suggested was Google Forms, partly because it’s likely to have some sort of longevity (for the number of respondents we expect, we would have to pay to make use of something like SurveyMonkey), partly because I couldn’t find a nice locally hosted survey tool (I did ask…).

Creating forms is easy, as this mini-tutorial suggests: Introduction to Google Forms.

As you would expect, different question types are possible:

One really nice feature of Google Forms – experimental at the time of writing – is an automatically generated summary results form, that where possible provides chart-based summary views for each question in the original form.

However, to view the summary charts (at the moment at least) you need to sign in with a Google account… which was one step too far for the course team who wanted the survey…

So the workaround I came up with was to create my own charts to display the results, using the data collected from the form, which is stored in a spreadsheet document.

– raw data:

– self-created charts:

(The results differ because the screenshots were captured at different times. The mix of chart styles is to provide groundwork for potential exercises regarding effective presentation and appropriate chart style selection for different sorts of data.)

So – how do you create your own results? I found the trick here:

I have a form on a spreadsheet and have a colum that is populated from some checkboxes in the form. This means I have a “, ” separated list of answers for each cell in this column.

COUNTIF(column,value) seems to only count rows where the only checkbox selected was value, not where other values are also in the list in the cell.

What function(s) can I use to properly count values as produced by the checkbox option from a form in spreadsheets?

Answer: (from user ahab)
Try this method:
B2: apples,grapes,pears [results in cell B2, etc.]
B3: oranges,apples,grapes,pears
B4: apples,pears

To find the number of times “grapes” were among the selected fruits:
=ARRAYFORMULA(COUNT(IFERROR(FIND(“grapes”;B2:B))))
As you may be using this formula on another sheet than the sheet the submitted data is on, don’t forget to prepend the range B2:B with a sheet name like this: ‘Sheet1’!B2:B

[The formula works as follows:]
=ARRAYFORMULA(COUNT(IFERROR(FIND(“grapes”;B2:B))))
ARRAYFORMULA this function allows array expressions and functions -which are cell-only oriented – to be applied to ranges; in this case to enable FIND to be applied to each row in the range B2:B . [That is, from cell B2 to the end of column B.]
COUNT this function counts the elements of the range it is presented with – a bit as one would expect it to do ;)
IFERROR this function evaluates its first parameter – if the first parameter evaluates OK this value is returned – if the first parameter returns an error an -optional- second parameter to IFERROR is evaluated. In this case there is no second parameter so on error nothing is returned.
FIND(“grapes”;B2:B) this function is the first and only parameter to IFERROR FIND looks for “grapes” in each row of the range B2:B – but is only able to do so because of the ARRAYFORMULA wrap. Without it FIND would only be able to search the first element in the range B2:B ; i.e. only in B2 only. When FIND doesn’t find “grapes” it produces an error.
So the range – allowed by the ARRAYFORMULA wrap – FIND finally produces would be something like:
8
16
#NA
Because of the IFERROR function wrap the #NA error is filtered out, producing the range
8
16
The COUNT function counts these elements: 2

The approach I took to counting the results was as follows – itemise the results options by copy and pasting the results options for each question from the survey form, and then count the number of the occurrences using the formula described above, with the filter term pulled in from the corresponding label:

Just by the by, there were a couple of gotchas – for example, for questions with numerical answers, “0” matched empty cells as well as actual “0” answers – so I renamed “0” as “Zero”.215

Data Scraping Wikipedia with Google Spreadsheets

Prompted in part by a presentation I have to give tomorrow as an OU eLearning community session (I hope some folks turn up – the 90 minute session on Mashing Up the PLE – RSS edition is the only reason I’m going in…), and in part by Scott Leslie’s compelling programme for a similar duration Mashing Up your own PLE session (scene scetting here: Hunting the Wily “PLE”), I started having a tinker with using Google spreadsheets as for data table screenscraping.

So here’s a quick summary of (part of) what I found I could do.

The Google spreadsheet function =importHTML(“”,”table”,N) will scrape a table from an HTML web page into a Google spreadsheet. The URL of the target web page, and the target table element both need to be in double quotes. The number N identifies the N’th table in the page (counting starts at 0) as the target table for data scraping.

So for example, have a look at the following Wikipedia page – List of largest United Kingdom settlements by population (found using a search on Wikipedia for uk city population – NOTE: URLs (web addresses) and actual data tables may have changed since this post was written, BUT you should be able to find something similar…):

Grab the URL, fire up a new Google spreadsheet, and satrt to enter the formula “=importHTML” into one of the cells:

Autocompletion works a treat, so finish off the expression:

=ImportHtml(“http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population&#8221;,”table”,1)

And as if by magic, a data table appears:

All well and good – if you want to create a chart or two, why not try the Google charting tools?

Google chart

Where things get really interesting, though, is when you start letting the data flow around…

So for example, if you publish the spreadsheet you can liberate the document in a variety of formats:

As well publishing the spreadsheet as an HTML page that anyone can see (and that is pulling data from the WIkipedia page, remember), you can also get access to an RSS feed of the data – and a host of other data formats:

See the “More publishing options” link? Lurvely :-)

Let’s have a bit of CSV goodness:

Why CSV? Here’s why:

Lurvely… :-)

(NOTE – Google spreadsheets’ CSV generator can be a bit crap at times and may require some fudging (and possibly a loss of data) in the pipe – here’s an example: When a Hack Goes Wrong… Google Spreadsheets and Yahoo Pipes.)

Unfortunately, the *’s in the element names mess things up a bit, so let’s rename them (don’t forget to dump the original row of the feed (alternatively, tweak the CSV URL so it starts with row 2); we might as well create a proper RSS feed too, by making sure we at least have a title and description element in there:

Make the description a little more palatable using a regular expression to rewrite the description element, and work some magic with the location extractor block (see how it finds the lat/long co-ordinates, and adds them to each item?;-):

DEPRECATED…. The following image is the OLD WAY of doing this and is not to be recommended…

…DEPRECATED

Geocoding in Yahoo Pipes is done more reliably through the following trick – replace the Location Builder block with a Loop block into which you should insert a Location Builder Block

yahoo pipe loop

The location builder will look to a specified element for the content we wish to geocode:

yahoo pipe location builder

The Location Builder block should be configured to output the geocoded result to the y:location element. NOTE: the geocode often assumes US town/city names. If you have a list of town names that you know come from a given country, you may wish to annotate them with a country identify before you try to geocode them. A regular expression block can do this:

regex uk

This block says – in the title element, grab a copy of everything – .* – into a variable – (.*) – and then replace the contents of the title element with it’s original value – $1 – as well as “, UK” – $1, UK

Note that this regular expression block would need to be wired in BEFORE the geocoding Loop block. That is, we want the geocoder to act on a title element containing “Cambridge, UK” for example, rather than just “Cambridge”.

Lurvely…

And to top it all off:

And for the encore? Grab the KML feed out of the pipe:

…and shove it in a Google map:

So to recap, we have scraped some data from a wikipedia page into a Google spreadsheet using the =importHTML formula, published a handful of rows from the table as CSV, consumed the CSV in a Yahoo pipe and created a geocoded KML feed from it, and then displayed it in a YahooGoogle map.

Kewel :-)

PS If you “own” the web page that a table appears on, there is actually quote a lot you can do to either visualise it, or make it ‘interactive’, with very little effort – see Progressive Enhancement – Some Examples and HTML Tables and the Data Web for more details…

PPS for a version of this post in German, see: http://plerzelwupp.pl.funpic.de/wikitabellen_in_googlemaps/. (Please post a linkback if you’ve translated this post into any other languages :-)

PPPS this is neat – geocoding in Google spreadsheets itself: Geocoding by Google Spreadsheets.

PPPS Once you have scraped the data into a Google spreadsheet, it’s possible to treat it as a database using the QUERY spreadsheet function. For more on the QUERY function, see Using Google Spreadsheets Like a Database – The QUERY Formula and Creating a Winter Olympics 2010 Medal Map In Google Spreadsheets.