Archive for August 2010
Who’s Using Mendeley in Your Institution?
In a couple of days, I’ll be at RepoFringe 2010, where the emphasis this year will be on ” OPEN: Open Data; Open Access; Open Learning; Open Knowledge; Open Content; etc…” I’m writing this post (a scheduled post, written over the weekend) in advance of putting my presentation together – so I’m not sure what it’ll be on yet! – but to get myself into the swing of things I started looking at what some of the repository bloggers have been thinking about lately, with a view to maybe doing a quick hack inspired by one or more of their posts…
…and it didn’t take long to find an itch to scratch… In a couple of recent posts looking at the extent to which personal document and metadata collections using apps like Mendeley might be seen as a figure:ground complement to a repository, (Comparing Social Sharing of Bibliographic Information with Institutional Repositories, More on Mendeley and Repositories), Les Carr started to explore “the extent of Mendeley’s penetration into a University. What is visible is the public profiles that Mendeley users have created. Although the Mendeley API doesn’t allow searching for users, I have been able to identify 53 public profiles from the University of Cambridge through Google (and a lot of manual verification!)” [my emphasis].
Hmmm… Sounds like that was a bit of a chore… can we finesse an API for that, I wonder?
To see how I put this Pipe together, let’s see what Google gives us (I’m limiting the search to mendeley.com because that’s where I want to find the profiles):
Note that there are several useful things we can spot simply from inspection of the Google search results:
- user profile information on Mendeley is located on URLs of the form www.mendeley.com/profiles/, so we can refine the site: search limit to take that into account (i.e. by using the limit site:www.mendeley.com/profiles/);
- the insitution name, if appropriately declared, appears in the page title, which gives the headline search result in Google results listings; so we can use search limits of the form intitle:”cambridge university”, or the more general intitle:cambridge
- sometimes (not shown in the image above), our search term appears in the title, but it’s the wrong one… So for example, if we have researchers in “Cambridge Massachusetts”, we may want to exclude results with Massachusetts in the title by negating an intitle limit: -intitle:massachusetts”
Putting those techniques together, and to test things out, we should be able to search for members of our institution using something like: site:www.mendeley.com/profiles/ intitle:”cambridge” -intitle:massachusetts
What else can we learn just by looking at the search results?
- if somebody’s surname matches the institution name, that may be returned as a result (e.g. Darren Cambridge). If we inspect the title, we see it has a regular structure: Name – Institution. Having got the results from Google, if we strip the name out of the title to leave just the affiliation, and then filter the results again to check that the search term appears in the affiliation, we can remove these false positives. (I have used this “double dip” search-then-filter approach in other contexts. For example, Paragraph Level Search Results on WordPress Using Digress.it and Yahoo Pipes.)
We’re now in a position to build a Yahoo pipe to create some sort of API to provide a Mendeley status search. A good way of getting Google search results into the Pipes environment is to use Google’s AJAX search API. The Google AJAX search API returns either four or eight results at a time, along with an indication of how many other “pages” of results there are, as well as an index count that identifies the index count of the first result on a page. (So for example, on the first page, the index of the first result is 0. For pages with four results, the index of the first result on the second page is 4, 8 on the third, and so on.) The first results page is complete – we actually get the search results listed. But the API also provides a list of the other results pages available, and the index of the first result on each page. To call results from the later pages, given the index of the first result, we use the additional URL argument &start=index.
The first part of the Pipe constructs the AJAX API URL. The user inputs are a bit of a fudge (and the result of a bit of trial and error!) to try to support as clean a results set as possible by virtue of how we phrase the search query…
http://ajax.googleapis.com/ajax/services/search/web?v=1.0&rsz=large
&q=site:www.mendeley.com/profiles/+intitle:cambridge+-intitle:massachusetts+-postdoc
Here’s where we construct the URL, and then fetch the data:
Just by the by, we can use another (unsaved) pipe to act as a previewer for AJAX search results:
If we want out pipe to display the results from all the pages, we need to grab the list of responseData.cursor.pages and then generate the “more results” page for each one. So, grab the list of page and first result index data:
and then create a URL for each of these, before grabbing the results from each results page:
Note that we are using the same query string that we used in the original search. (Also note that we only seem to get at most 64 responses; maybe the page list for pages later than the first page provide indices for more results? That is, maybe each results page only lists at most indices for 8 pages of results?
Having got the search results, we rename the results attributes to generate valid RSS elements (title, description, link) and do a spot of post filtering.
Remember the case where the search result appeared becuase the institution name was actually someone’s surname? The Regular Expression block strips out the Mendeley user’s name and allows us to filter the results on just the affiliation, to remove those false positives (the filter lets through results where the institution search term appears in the title’s affiliation part):
The resulting pipe allows us to search for Mendeley users by institution:
(Having built the pipe, I think that an even more robust approach might be to tokenise the search terms required in the title and them add them as separate intitle: limits. So for example intitle:cambridge intitle:university would find pages where both Cambridge University and University of Cambridge appear in the page title.)
So that’s the pipe…
In many ways, it implements some sort of “stalker pattern” based on profile information that is released via title elements on personal profile pages. I’ve demonstrated a similar approach previously in A Couple of Twitter Search Tricks, which shows (courtesy of an update I added after a tweet from @daveyp) how to do a similar sort of search to find folk twittering with a university allegiance. In fact, here’s a pipe to do something that approximates to just that – Twitter profile search (via Google and Yahoo Pipes):
A quick scout round other social networks shows that this is a trick we can use widely:
- site:uk.linkedin.com/in intitle:smith (try using different country codes for the subdomian to search different countries)
- site:facebook.com intitle:smith
- site:myspace.com “milton keynes” intitle:”25- Female” (the original demo, hence “the stalker pattern” epithet!)
Unfortunately, it’s not obvious how to search for anything other than name on Slideshare, or Scribd (that is, there is no obviously easy way of searching for members of an institution on Slideshare). This in turn suggests to me that if you are developing a site with a social element, and you ant people to be able to use things like Google search to finesse additional, structured search functionality over the site (as in the Mendeley user profile search pipe), you should design title elements with all due consideration…
PS in his original post, Les Carr went on: “Incredibly, only TWO of those 53 researchers have any existing deposits in Cambridge’s institutional repository.” So maybe the next step would be to build some pipework to run Mendeley discovered users against corresponding institutional repositories?;-)
Using Graphviz to Explore the Internal Link Structure of a WordPress Blog
In The Structure of OUseful.Info, I showed how it was possible to extract an autopingback graph from a WordPress blog (that is, the graph that shows which of the posts in a particular WordPress blog link to other posts in that blog), illustrating the post with a visualisation of linkage within OUseful.info generated using Gephi.
What I didn’t do was post any examples of the views that we can generate in Graphviz – so here are a couple generated without additional flourishes from a simple statement of links between posts.
Firstly, we see a series of posts relating to WriteToReply, and commentable documents:
In the following example, we see a series of self-contained posts on Library Analytics:
Note that from the original library-analytics-part-1 post, we can see how two strands developed sound this topic (remember, arrows typically go from a more recent post to an earlier one; that is, the links typically go from a new post to one that already exists…)
Here’s another set of posts on a topic – this time privacy and Facebook:
(The bidirectional linkage arose from me editing the body of a pre-existing post with a link to a later one.)
One thing I haven’t explored yet is the groupings that arise from an analysis of the tags and categories I used to annotate each post. But what the above shows is that even in the absence of tags and categories, link structure may also be used to aggregate posts on a particular topic, and allow clusters of blog posts, or partitions containing link related posts, to be easily identified – and extracted – from the blog…
…and my supposition is that this sort of structure might be used to facilitate value adding navigation structures…
The Structure of OUseful.Info
A blog is just so many blog posts, right? Wrong… it also has the potential to be full of structure. One of the things I try to do in OUseful.info is post links not only to related third party sites, but also back to previous blog posts on OUseful.info to provide additional information, context or explanations that add value to the current post. Through the magic of trackbacks/pingbacks, the WordPress platform notices when I link to one OUseful.info post from another, and adds a trackback/pingback style to that linked to post referring back to the post that included the link (got that!?;-)
So if I add a link whilst writing post B to post A, a pingback style comment will be added to post A saying that post B mentioned it.
If we take an export dump of a WordPress blog, we can search through it to identify each post, and each trackback/pingback comment:
We can then create a file that defines each blog post as a network node, an each pingback as an edge connecting two nodes (so if blog post A links to post B, we draw an edge from node A to B).
Here’s a cobbled together Python script to do just that (as a gist):
import string
from xml.dom import minidom
#based on http://code.activestate.com/recipes/551792-convert-wordpress-export-file-to-multiple-html-fil/
infile='wpexport.xml'
dotfile='internalstructure.dot'
csvfile='internalstructure.csv'
dom = minidom.parse(infile)
f = open(dotfile, 'w')
f2 = open(csvfile,'w')
blog=[]
f.write('digraph blogstruct{')
for node in dom.getElementsByTagName('item'):
post = dict()
post["title"] = node.getElementsByTagName('title')[0].firstChild.data
post["date"] = node.getElementsByTagName('pubDate')[0].firstChild.data
post["link"] = node.getElementsByTagName('link')[0].firstChild.data
post['comments'] =[]
for comment in node.getElementsByTagName('wp:comment'):
commentInfo = dict()
if comment.getElementsByTagName('wp:comment_type')[0]:
c= comment.getElementsByTagName('wp:comment_type')[0]
if c.firstChild:
commentInfo['type']= comment.getElementsByTagName('wp:comment_type')[0].firstChild.data
commentInfo['url']= comment.getElementsByTagName('wp:comment_author_url')[0].firstChild.data
commentInfo['date']= comment.getElementsByTagName('wp:comment_date')[0].firstChild.data
if commentInfo['type']=='pingback' and commentInfo['url'].find('http://blog.ouseful.info')!=-1:
cID=commentInfo['url'].strip('/')
cID=cID.rpartition('/')
rID=post["link"].strip('/')
rID=rID.rpartition('/')
f.write('"'+cID[2]+'"->"'+rID[2]+'"\n')
f2.write('"'+cID[2]+'","'+rID[2]+'"\n')
#post['comments'].append(comments)
#blog.append(post)
f.write('}')
f.close()
f2.close()
Note that the WordPress export file seemed to be incomplete (the Python parser didn’t like it…) – I had to add the Atom namespace definition: xmlns:atom=”http://www.w3.org/2005/Atom/”
In the above snippet, I generate two sorts of output file – a dot file for use with Graphviz, and a CSV file that can be loaded in to Gephi. The code is also customised for only showing pingbacks with the blog.ouseful.info domain – to use the code on your own blog you’d have to tweak that bit…
Anyway, here’s a glimpse of the the structure of the internal pingback links from OUSeful.info visualised using Gephi and a Force Atlas layout:
You can see that the blog as a whole contains a fair amount of structure. For sure, there are some posts that are only linked to by one other post and appear to “float” with respect to the rest of the blog posts (unlinked posts are not identified as nodes by my trackback graph script); but there are also long chains of posts that suggest I have developed an idea over multiple posts…
When I get a chance, I’ll have a go at using some of Gephi’s network analysis tools on this graph, but for now – back to the Bank Holiday:-)
See also: Visualising CoAuthors in Open Repostory Online Papers, Part 1 and Visualising CoAuthors in Open Repository Online Papers, Part 2, as well as Emergent Structure in the Digital Worlds Uncourse Blog Experiment and Uncovering a Little More Digital Worlds Structure
Slideshare Stats – Number of Views of Your Recent Slideshows
Yesterday morning, I wanted to grab hold of a summary of the number of views my uploaded presentations on Slideshare have had, A quick scan of the Slideshare API suggests that a bit of a handshake is required, at least in generating an MD5′d hash of a key with a Unix timesatamp. I have a pipe that does something similar somewhere (err, or at least part of it… here maybe).
I didn’t have the 10 minutes or so such a pipework hack should take (i.e. half an hour, just in case, plus up to half an hour to blog any solution I came up with;-), so I had a quick look at the YQL community tables to see if anyone had developed a wrapper for calling at least part of the Slideshare API, and it seems some has:
So here’s a pipe that generates a list of a user’s 20 most recent Slideshare uploads, along with how many times they have been downloaded:
And here’s how the output looks:
Note to self: make some time to see what other YQL community tables are available…
In For a Penny, In For a Pound… My Promotion “Case for Support”
JUst before going away on holiday, I popped up a questionnaire asking for a little help working out what sort of impact – if any – I had on folk that could weave in to my promotion case for support… Thanks to all who took the time out to reply (it was very humbling:-)
Anyway, for what it’s worth, here’s a draft of the Case for Support, which I need to submit tomorrow. Whilst I haven’t been able to add direct quotes from the questionnaire responses – the word limit is set at 1500 words – your responses did inform what I wrote: some of the words are very heavily loaded and more densely packed, on occasion summarising whole responses…
Tony Hirst – Case for promotion to Senior Lecturer
My case for promotion is based around excellence in teaching and scholarship, with a strong theme of digital scholarship and community engagement.Teaching & contributions to the teaching system
I have chaired three courses (production and presentation), and authored on four others, pushing the elearning agenda through technology and design innovation with a view to reuse.
In 2000, I developed two units for T396 delivered via a novel electronic study guide, providing a unified browser-based interface to online, offline and CD-ROM content, and a mobile website for course alerts. This work identified issues relating to authoring content specifically for browser based delivery on desktop and mobile devices that have informed my work ever since.
A major feature of my approach to the production of teaching materials relates to supporting reuse in other contexts. Whilst writing online material for the T184 robotics course, I commissioned several interactive browser-based activities that have been reused on courses such as TXR174, as well as for outreach. Using T184 software, I developed a range of activities for schools and OU regional Aim Higher/Widening Participation initiatives. These were delivered at over 50 events by the OU Robotics Outreach Group (which I co-founded 1 and whose members produced over 20 formal publications during the period 2000-2007, as well as press coverage). In turn, the activities informed the course material design for the TXR174 residential school robotics activity as a series of worksheets, some of which were consequently reused verbatim on T885. Interactives I developed for TU120 have also been used in wider reaching Library training activities.
As chair of T184, contributor to TU120, I used Google Analytics for discovering how students used online course and Library materials. This led, in part, to the OU Library website adopting and using Google Analytics for tracking Key Performance Indicators.
In both T184 and TU120, I lobbied for the used of embedded third party content from Youtube and flickr within course materials, working with the OU Rights Department to clarify issues around the use of such content, making it easier for other course teams to draw on similar resources in the future.
For T151 Digital Worlds, I created a flexible and adaptable curriculum of the sort that IET are now exploring using a structure that also informed the development of T123. An innovative custom search engine capable of searching over all and only the public web-based resources linked to from the T151 course materials led to a similar service being adopted in TT381; T151 also used an interactive mindmap (now being explored by LTS) to provide at-a-glance views over the whole course on one screen.1 As member of the OU-ROG I helped organise a Blue Peter “Design a Robot’ Competition (>20, 000 entries); a national junk modeling event (sponsored by 20th Century Fox, funding I secured); the launch of an OU badged hobbiest robotics range sold via high street retailers; convened three conference workshops on artificial intelligence and robotics at UK based international conferences, as well as RoboFesta-UK, a series of five annual meetings attended by between 50 and 100 members of the UK robotics education community each year; I was PI for the EPSRC funded Creative Robotics Research Network (rated tending to outstanding).
Scholarly work
Throughout my career I have explored new methods of digital scholarship and ways of using technology to transform research, dissemination and knowledge construction, developing an international reputation as an advocate of emerging web technologies through community engagement.
Google Suggest (results based in part on frequency of searches around particular search terms) shows my name is strongly associated with The Open University. The line chart shows page views (excluding RSS subscription views) on the OUseful.info blog.The heart of my scholarly activity is the OUseful.info blog, started in 2005. Since July 2007, it has grown to attract >2000 regular subscribers, c. 1000 views per day, contains over 500 posts (the most notable attracting over 20,000 views each), and has received over 1500 comments; over the last 12 months, there have been c 20,000 clicks through to external sites, including over 4,000 to a single site (Adobe Flash Privacy settings). Nominated for the 2008 Edublog awards, it is regularly listed in the top 30 UK technology blogs (wikio) and currently has Technorati rating of 426. Research conducted for Online Services ranked it as the 10th most influential site for ‘distance learning’ (above the BBC) and 2nd as a hub for connections around this term 2.
2 See http://nogoodreason.typepad.co.uk/no_good_reason/2009/06/connections-versus-outputs.html
I have an active Twitter presence (>2400 curated followers, >5000 click-throughs on shared links per month (bit.ly), >100 unique retweeters (klout)).
With an archive of presentations on Slideshare dating back to 2006, my top three presentations have drawn over 20,000 views between them and the twenty presentations posted so far this year have attracted >9000 views.
I am established as a prominent member of the global edu-blogger community, receiving a large number of online citations and credits, and many speaker invitations. As a prominent OU blogger, my work is used as a model for the development of digital scholarship within the University.
Reflecting evolving notions of digital scholarship, my reputation spans several disciplines, as evidenced by a public call I put out for feedback on the impact and influence I have had on others 3.3 The web based form attracted 26 submissions (24 legitimate) and provides strong anecdotal and personally communicated evidence for the claims that follow. I am happy to provide access to the responses on request.
Since publishing the first MPs’ Travel Expenses Map visualisation in 2009, I have developed a strong reputation in the data journalism area at an international level: my blog posts are used to demonstrate good practice by several industry websites (e.g. journalism.co.uk), my work is being shared in several different UK universities and referenced widely in others’ conference presentations (e.g. regularly by Simon Rogers, Guardian Datastore Editor); I have received several invitations to present at journalism events.
Building on the open source WriteToReply document discussion platform I co-founded in 2009 (as mentioned in the national press (BBC, Guardian)), I helped win JISC Rapid Innovation funding and further exploitation funding for JISCPress. JISCPress has been used to publish JISC Strategy documents and reports, and is currently being discussed as a potential tool for publishing commentable internal OU documents. WriteToReply has been used by several government departments (including DCMS, The Cabinet Office, ONS) to republish consultation and guidance documents in commentable form. WriteToReply has contributed to the development and adoption of “commentable documents” as a consultation strategy type within UK government and, via JISCPress, benefited from an accessibility review commissioned by the Department of Innovation Business and Skills.
In 2007, I brought together an informal team to develop an OU Facebook application, devising and leading the development of the OU Course Profiles application (> 6,000 users soon after it launched), following it up with a peer support application: My OU Story. To date, these remain the OU’s only Facebook applications, although more are now planned with VC support.
My approach towards rapid prototyping has resulted in numerous invitations to run practical workshops, as well as being referenced in several JISC funding calls. The “technology recipes” I publish have been widely adopted and reused by individuals within institutional contexts in the UK and internationally, both for service delivery (e.g. the use of RSS feeds and Yahoo Pipes for content syndication and information processing) as well as teaching (e.g. relating to the use of social technologies in both postgraduate and undergraduate courses).
I have advocated the use of interactive web technologies using lightweight approaches in support of OU/BBC broadcasts: in my role as OU academic supporting the BBC World Service’s Digital Planet, I set precedents in the use of embedded, user generated content from third party services, such as YouTube and flickr within programme support pages, and commissioned the development of an interactive map for listeners that allowed them to show where they were listening to the programme (to date, over 1100 listeners have done so).
Examples of content embedding on open2.net.Following my Arcadia Fellowship with the Cambridge University Library (“the UL”), several of the recommendations I made fed directly into the UL’s latest round of strategic planning. Outcomes from my Fellowship also fed in to two future service delivery workshops I ran for the OU Library.
My approach towards “openness” is based on a deep belief in the idea of community engagement, and the role of the academic in supporting communities around them. My social networking activities provide an element of extended support to a wide ranging community, not dissimilar to the support provided as part of a PhD supervision process.
My willingness to share ideas means others are free to develop them. The iTitle and uTitle social media caption tools developed by JISC Regional Support Centre’s Martin Hawksey, and that have been used to annotate several conference video archives with backchannel commentary, are a direct result of ideas posted to OUseful.info.Word count: 1500
If you can track something back to what you said, and if I have misrepresented it, please let me know. If you think there are any glaring omissions, please also let me know;-)
PS interesting.. Impact Research Fellow, DMU. This post is a unique opportunity to analyse the impact of a group of key social media projects in relation to business innovation and the growing field of transliteracy research. It is ideally suited to a scholar wishing to examine the importance of impact in relation to a substantial example of social media practice" [via @ambrouk/@suethomas]
[UPDATE: the attempt at promotion turned out to be ouseless]
Doodlings Around the Data Driven Journalism Round Table Event Hashtag Community
…got that?;-) Or in other words, this is a post looking at some visualisations of the #ddj hashtag community…
A couple of days ago, I was fortunate enough to attend a Data Driven Journalism round table (sort of!) event organised by the European Journalism Centre. I’ll try to post some notes on it, err, sometime; but for now, here’s a quick collection of some of the various things I’ve tinkered with around hashtag communities, using #ddj as an example, as a “note to self” that I really should pull these together somehow, or at least automate some of the bits of workflow; I also need to move away from Twitter’s Basic Auth (which gets switched off this week, I think?) to oAuth*…
*At least Twitter is offering a single access token which “is ideal for applications migrating to OAuth with single-user use cases”. Rather than having to request key and secret values in an oAuth handshake, you can just grab them from the Twitter Application Control Panel. Which means I should be able to just plug them into a handful of Tweepy commands:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(key, secret)
api = tweepy.API(auth)
So, first up is the hashtag community view showing how an individual is connected to the set of people using a particular hashtag (and which at the moment only works for as long as Twitter search turns up users around the hashtag…)
Having got a Twapperkeeper API key, I guess I really should patch this to allow a user to explore their hashtag community based on a Twapperkeeper archive (like the #ddj archive)…
One thing the hashtag community page doesn’t do is explore any of the structure within the network… For that, I currently have a two step process:
1) get a list of people using a hashtag (recipe: Who’s Tweeting our hashtag?) using the this Yahoo Pipe and output the data as CSV using a URL with the pattern (replace ddj with your desired hashtag):
http://pipes.yahoo.com/pipes/pipe.run?_id=1ec044d91656b23762d90b64b9212c2c&_render=csv&q=%23ddj
2) take the column of twitter homepage URLs of people tweeting the hashtag and replace http://twitter.com/ with nothing to give the list of Twitter usernames using the hashtag.
3) run a script that finds the twitter numerical IDs of users given their Twitter usernames listed in a simple text file:
import tweepy
auth = tweepy.BasicAuthHandler("????", "????")
api = tweepy.API(auth)
f =open('hashtaggers.txt')
f2=open('hashtaggersIDs.txt','w')
f3=open('hashtaggersIDnName.txt','w')
for uid in f:
user=api.get_user(uid)
print uid
f2.write(str(user.id)+'\n')
f3.write(uid+','+str(user.id)+'\n')
f.close()
f2.close()
f3.close()Note: this a) is Python; b) uses tweepy; c) needs changing to use oAuth.
4) run another script (gist here – note this code is far from optimal or even well behaved; it needs refactoring, and also tweaking so it plays nice with the Twitter API) that gets lists of the friends and the followers of each hashtagger, from their Twitter id and writes these to CSV files in a variety of ways. In particular, for each list (friends and followers, generate three files where the edges represent: i) link between an individual and other hashtaggers (“inner” edges within the community); ii) link between hashtagger and not hashtaggers (“outside” edges from the community); iii) links between hashtagger and hashtaggers as well as not hashtaggers);
5) an edit of the friends/followers CSV files to put them into an appropriate format for viewing in tools such as Gephi or Graphviz. For Gephi, edges can be defined using comma separated pairs of nodes (e.g. userID, followerID) with a bit of syntactic sugar; we can also use the list of Twitter usernames/user IDs to provide nice labels for the community’s “inner” nodes:
nodedef> name VARCHAR,label VARCHAR
23309699,tedeschini
23751864,datastore
13691922,nicolaskb
…
17474531,themediaisdying
2224401,munkyfonkeyedgedef> user VARCHAR,friend VARCHAR
23309699,17250069
23309699,91311875
…
2224401,878051
2224401,972651
Having got a gdf formatted file, we can load it straight in to Gephi:
In 3d view we can get something like this:
Node size is proportional to number of hashtag users following a hashtagger. Colour (heat) is proportional to number of hashtaggers follwed by a hashtagger. Arrows go from a hashtagger to the people they are following. So a large node size means lots of other hashtaggers follow that person. A hot/red colour shows that person is following lots of the other hashtaggers. Gephi allows you to run various statistics over the graph to allow you to analyse the network properties of the community, but I’m not going to cover that just now! (Search this blog using “gephi” for some examples in other contexts.)
Use of different layouts, colour/text size etc etc can be used to modify this view in Gephi, which can also generate high quality PDF renderings of 2D versions of the graph:
(If you want to produce your own visualisation in Gephi, I popped the gdf file for the network here.)
If we export the gexf representation of the graph, we can use Alexis Jacomy’s Flex gexfWalker component to provide an interactive explorer for the graph:
Clicking on a node allows you to explore who a particular hashtagger is following:
Remember, arrows go from a hashtagger to the person they are following. Note that the above visualisation allows you to see reciprocal links. The colourings are specified via the gexf file, which itself had its properties set in the Preview Settings dialogue in Gephi:
As well as looking at the internal structure of the hashtag community, we camn look at all the friends and/or followers of the hashtaggers. THe graph for this is rather large (70k nodes and 90k edges), so after a lazyweb reuest to @gephi I found I had to increase the memory allocation for the Gephi app (the app stalled on loading the graph when it had run out of memory…).
If we load in the graph of “outer friends” (that is the people the hashtaggers follow who are not hashtaggers) and filter the graph to only show nodes who have more than 5 or so incoming edges we can see which Twitter users are followed by large numbers of the hashtaggers, but who have not been using the hashtag themselves. Becuase the friends/followers lists return Twitter numercal IDs, we have to do a look up on Twitter to find out the actual Twitter usernames. This is something I need to automate, maybe using the Twitter lookup API call that lets authenticated users look up the details of up to 100 Twitter users at a time given their numerical IDs. (If anyone wants this data from my snapshot of 23/8/10, drop me a line….)
Okay, that/s more than enough for now… As I’ve shared the gdf and gexf files for the #ddj internal hashtaggers network, if any more graphically talented than I individuals would like to see what sort of views they can come up with, either using Gephi or any other tool that accepts those data formats, I’d love to see them:-)
PS It also strikes me that having got the list of hashtaggers, I need to package up this with a set of tools that would let you:
- create a Twitter list around a set of hashtaggers (and maybe then use that to generate a custom search engine over the hashtaggers’ linked to homepages);
- find other hashtags being used by the hashtaggers (that is, hashtags they may be using in arbitrary tweets).
(See also Topic and Event based Twittering – Who’s in Your Community?)
My Slides from the Data Driven Journalism Round Table (ddj)
Yesterday, I was fortunate enough to attend a Data Driven Journalism round table (sort of!) event organised by the European Journalism Centre.
Here are the slides I used in my talk, such as they are… I really do need to annotate them with links, but in the meantime, if you want to track any of the examples down the best way is to probably just search this blog ;-)
(Readers might also be interested in my slides from News:Rewired (although they make even less sense without notes!))
Although most of the slides will be familiar to longtime readers of this blog, there is one new thing in there: the first sketch if a network diagram showing how some of my favourite online apps can work together based on the online file formats they either publish or consume (the idea being once you can get a file into the network somewhere, you can route it to other places/apps in the network…)
The graph showing how a handful of web apps connect together was generated using Graphiz, with the graph defined as follows:
GoogleSpreadsheet -> CSV;
GoogleSpreadsheet -> “<GoogleGadgets>”;
GoogleSpreadsheet -> “[GoogleVizDataAPI]“;
“[GoogleVizDataAPI]“->JSON;
CSV -> GoogleSpreadsheet;
YahooPipes -> CSV;
YahooPipes -> JSON;
CSV -> YahooPipes;
JSON -> YahooPipes;
XML -> YahooPipes;
“[YQL]” -> JSON;
“[YQL]” -> XML;
CSV->”[YQL]“;
XML->”[YQL]“;
CSV->”<ManyEyesWikified>”;
YahooPipes -> KML;
KML->”<GoogleEarth>”;
KML->”<GoogleMaps>”;
“<GoogleMaps>”->KML;
RDFTripleStore->”[SPARQL]“;
“[SPARQL]“->RDF;
“[SPARQL]“->XML;
“[SPARQL]“->CSV;
“[SPARQL]“->JSON;
JSON-> “<JQueryCharts_etc>”;
I had intended to build this up “live” in a text editor using the GraphViz Mac client to display the growing network, but in the end I just showed a static image.
At the time, I’d also forgotten that there is an experimental Graphviz chart generator made available via the Google Charts API, so here’s the graph generated via a URL using the Google Charts API:
Here’s the chart playground view:
PS if the topics covered by the #ddj event appealed to you, you might also be interested in the P2PU Open Journalism on the Open Webcourse, the “syllabus” of which is being arranged at the moment (and which includes at least one week on data journalism) and which will run over 6 weeks, err, sometime; and the Web of Data Data Journalism Meetup in Berlin on September 1st.
Idle Thoughts – A Few More Approaches to Making CSV Files Queryable
How much more useful would CSV data be if it was queryable?
In a previous post (Using CSV Docs As a Database) I described a recipe for importing a CSV file into a Google spreadsheet so that the data it contained could be queried using the Google visualisation API. Whilst there isn’t a lot of technical knowledge required to republish the data in this way, there is still the overhead of requiring the user to log in to a Google account, create a spreadsheet, import the CSV document and then discover the spreadsheet URL. And bearing in mind the rule of thumb that two clicks is at least one click too many, this route to making documents republishable is likely to be seen as too complicated for many people.
So here are a few partial ideas about other ways in which we might be able to (re)publish CSV documents so that they become queryable. (That is, lightweight ways of providing a query interface to a CSV doc.)
Firstly, I wonder whether or not it would be possible to tweak the example Google Chart tools data source implementation example so that a user could just upload or link an already online CSV document as an external data source? The example provides a tutorial for how to use a CSV document as an external datasource by wrapping it with a (provided) Java library that implements the Google data source API; so I wonder: could this example be tweaked so that any CSV files uploaded to or placed in a specified directory/folder could be selected as the external datasource, meaning that a council officer could expose a queryable interface to a CSV document simply by pasting a copy of the CSV document into a particular folder? Or maybe modifying the example to become a service such that if it it was passed a link to an online CSV doc, it would allow that document to be treated as the external datasource and provide the query interface to it?
Secondly, it seems to me that YQL also offers a query interface to arbitrary online CSV documents? The post Analyzing World Cup Data with YQL on the Yahoo Developer blog gives an example of how to use YQL to write queries over a CSV document (although rather perversely they use a Google spreadsheet as the CSV source!) but we can presumably do a similar thing with data listed on data.gov.uk for example. Trying out the:
select * from csv where url=’http://example.com/whatever.csv’
approach on several URLs through an error (“Unable to parse data using default charset utf-8″), though apparently it is possible to force the select … from csv handler to accept a specified charset; (there are two problems that then arise for me at least: i) I’m not sure how to specify the charset; ii) I don’t know how to detect the charset of the files that are apparently not being recognised by YQL as utf-8. It does make me think, though, that guidance about setting charsets (maybe as the server specified MIME-TYPE for CSV files? Or am I talking b****ks? If not, maybe this Google apps script URLFetch function might help?!) may be required as part of the gotcha guidelines for publishing online open data?
Anyway, here’s an example of a CSV file indexed on data.gov.uk that I could query – Bournemouth Libraries:
That is:
select * from csv where url=’http://www.bournemouth.gov.uk/library/data/libraries.csv’ and col1 like ‘%BH8%’
(A list of the other query filters supported within the YQL SELECT … WHERE statement can be found here: Filtering Query Results (WHERE).)
Although I forget how at the moment, I seem to remember that it is possible to create parameterised short URL queries over YQL, so for example I could presumably now create a short URL query along the lines of http://example.com/yql/bournmeouthLibraries?postcode=???? that would take in the first half of a postcode, for example, and then query the Bournemouth libraries CSV document as a database for libraries whose addresses specify that postcode area?
Thirdly. I almost feel obliged to demonstrate a Yahoo Pipes implementation… so without any real explanation, how about the following?
(I suspect that for large CSV files, the only solution that would work would be the Google VIz API external datasource example. I know Yahoo Pipes borks on large HTML files, and I’ve had YQL time out trying to query a large XML file before now…)
[Note that this is a scheduled post and that I am on holiday at the moment - away from the net and without a computer to hand. Which is why I haven't tried out any of the above, and don't intend to for at least a week or two...;-)]
PS see also this bit of @codepo8 magic http://isithackday.com/csv-to-webservice/ (reminded of this via @OSandCMS)
Crowd Sourcing a Promotion Case…
So racked with embarrassment at doing this (’tis what happens when you don’t publish formally, don’t get academic citations in the literature, and don’t have a “proper” academic impact factor;-) I’m going to take the next 10 days off in a place with no internet connection…. but anyway, here goes: an attempt at crowd-sourcing parts of my promotion case….
Approaches to Soliciting Opinions for Institutional Responses to Formal Consultations
One of the things we didn’t put into the original JISCPress bid – though in hindsight we might have – was a use case for commentable documents in the context of government consultations soliciting formal responses from higher education institutions (for example, Universities UK: Review of External Examining Arrangements in the UK).
From a chat with Alison Nash in the OU’s recently reorganised Strategy Unit (I think?), it seems that candidate consultations are fielded by a member of that unit who then emails likely suspects (identified quite how, I’m not sure?) with either a link to, or copy of, the consultation document; (these are typically Word or PDF documents). As with many of the consultations we have looked at in the context of Write To Reply, the consultations typically have a set of questions associated with them that are distributed throughout the consultation document as a whole. Comments and responses to questions are then returned by email (I didn’t ask whether this is typically in the body of an email message, in a Word document, or as comments or highlighted changes on a copy of the orginal consultation), collated (again, I’m not sure how? One way would be to use a spreasheet, with rows for respondent and columns for each question (or vice versa)), and used to frame the institutional response. (I’m not sure if a draft of the institutional response was then circulated to the orginal commenters for final comment…?) The question that was then asked was: would a WriteToReply style approach be appropriate for managing returns of comments and answers to consultation questions in a rather more organised way than is currently the case?
(If anyone from the OU, or other HEIs who engage in producing formal instituional responses to consultations would like to provide further detail about the workflow for soliciting internal comments, producing drft and final versions of instituional responses, and then tracking the impact of comments made in the response, please post a comment to this post…)
Here are some thoughts/matters arising relating to how the WriteToReply/JISCPress/digress.it approach might apply:
- comments may need to be private; this could be achieved by hosting WordPress within the firewall, limiting who can view comments to members of the institution, or not making comments public (e.g. by moderating them, meaning that only the blog owner could see them). Limiting who can make comments can be achieved by requiring users to log in to the blog, and only providing certain users with log in accounts.
- it may not be appropriate to allowing commenting on all paragraphs, instead requiring users to only comment on actual questions. This might be achieved by disabling comments on all pages except a single summary page that contains one paragraph per question, maybe with links back to the actual posts that contain the question in context.
- if comments are solicited throughout the document, a dashboard tool such as Netvibes can be used to aggregate comments from different sections of the document; tools like Yahoo Pipes can also be used to aggregate comments from separate areas of the document and display them in a single view. Views over comments by individual commenters are also available and may be collated together on dashboard pages (for example, with separate pages aggregating comments from different sorts of commenter – e.g. allowing views over responses by Faculty, for example).
- once a formal response has been produced, it may be appropriate it post it on the consultation site to allow commenters to see how their comments were o weren’t integrated in to the official response (maybe leaving it open to them to submit a personal response to the consultation if they feel their views were not appropriately reflected, if at all. (The more I think about the process of these document based consultations, the more I feel a feedback loop is required that allows folk to see what sort of impact, if any, their comments may have had. I also briefly touched on this in On the Different Roles Documents and Comments May Take in a Commentable Document.) The consultation document site then becomes an important part of institutional memory, archiving as it does the original consultation, individual comments from members of the institution, and the institution’s formal response. It might also be the case that a draft of the institutional response is placed on the same site and comments on it solicited. (The site would then be hosting documents in two modes – the original consultation mode document, and then a draft mode document (again, this distinction appears in the Different Roles blog post.)
In many cases, it might be that the paragraph level commenting approach is not appropriate – unless comments are limited to just the consultation questions themselves, each as a separate commentable item. Where it is appropriate to isolate consultation questions from the surrounding text, a simple form may provide the best way of capturing comments.
In the OU, where I believe we are about to start rolling out Google Apps for Education to at least some of our students over the next month or two, it might be appropriate to look at using a Google form as platfrom for capturing comments. As well as satisfying the immediate goal (capture comments in a centralised way), this approach would also provide a legitimate and low risk use case for exloring how we might make use of the Google Apps environment as part of internal business processes.
The simplest case, then, would be for the internal staff member responsible for gathering comments to create a Google form. I don’t know if internal staff members have yet been issued with login details for how to access Google Apps on the open.ac.uk domain, but in the interim they can either create a personal Google account (or I could let them have an account on one of my Google apps domains!). Creating a form can be done either from the main docs menu, or within a Google spreadsheet (the posted form results are collated within a spreadsheet).
For most consultations based around a set of specific questions, the format of the form would look something like this:
That is, a copy and pasted copy of each consultation question (with minor tweaks so the question makes sense in a standalone questionnaire) as a separate form item, with a Paragraph text element for the response.
If additional commentary is required, the section head (which includes a description component) can be used to display it:
It might also be worth capturing “any other comments” in a final paragraph text comment at the end of the questionnaire.
Although the form, once published, would be open to anyone on the apps domain, (if they knew the URL), a further “security” measure would be to prompt the user for a consultation “pass phrase” emailed to them as part of the request for comments (“please enter this keyphrase when you complete the form so we can put your responses into the class of ‘high priority’ responses”;-) This might even be a required element.
Alternatively, a keyphrase element could be used to sort the responses in the results spreadsheet, or as suggested above in the context of digress.it, used to sort responses for example by Faculty, (Alternatively, an optional unique key code be be generated for each invited response to identify their responses. Or we could request an OU identifier, name, email address etc to track who made what comment (though these approaches are gameable and don’t necessarily imply that the person with a given identifier is the person who submitted the form…)). If users are logged in to the Google Apps environment, it may be that their identity is recorded anyway…? Hmm….
For just collecting responses, pretty much anyone could just set up the form and then email the link to the form to the potential commenters. With the availability of Google Apps script, and a little bit of developer time, it would also be possible to provide alerts to the internal consultation organiser whenever a form submission is made, provide automated collation of responses by question and pop these into a Google wordprocessor doc (I think…?!) and even manage a circulation list – so for example, a list of respondents could be created in a spreadsheet, used to mail out invitations for them to complete the form, and then track their response. In the advent of them not responding within a certain period, an automated reminder could be sent out. (I’m guessing it would take about a day to build and test such a workflow, which once created would be reusable.)
Another advantage of using the Google Apps approach would be that the response spreadsheet (or an automagically maintained Google wordprocessor doc version of it) could be shared to other members of the team providing the formal institutional response as an online shared document appearing in each individual’s Google docs “inbox”.
PS it seems that within a Google Apps for Edu environment, it may now also be possible for users to edit their form responses if they want to revise their answers…
PPS it’s also worth noting here a couple of practical considerations about how to write a consultation document bearing in mind that someone might put together a form to collate the responses. Firstly, the question should make sense as a standalone item (i.e. out of context) or very clearly identify what it is referring to rather than just “the above”. Secondly, if the questions are collated together in a single appendix, it makes it easier to just check off that each question has been included in the form. (It’s also handy as a one page item for someone who is putting together the response.) Links to the original context also help; (in a sense, this sort of Appendix is like “List of Tables” or “List of Figures” that acts as contents page for locating questions within the document). Reading over the questions in an Appendix will also make it obvious whether or not the question was written in such a way that it implicitly refers to content surrounding it in the original embedded context (“see the above” again…) Note that I’m not saying questions shouldn’t be embedded, just that when they are taken out of context, they still make sense and read well. In the example I give above about external examiners, the questions had to be tweaked so that they made sense as standalone items.

































