Tagged: digischol

Authoring Dynamic Documents in IPython / Jupyter Notebooks?

One of the reasons I started writing the Wrangling F1 Data With R book was to see how it felt writing combined text, code and code output materials in the RStudio/RMarkdown context. For those of you that haven’t tried it, RMarkdown lets you insert executable code elements inside a markdown document, either as code blocks or inline. The knitr library can then execture the code and display the code output (which includes tables and charts) and pandoc transforms the output to a desired output document format (such as HTML, or PDF, for example). And all this at the click of a single button.

In IPython (now Jupyter) notebooks, I think we can start to achieve a similar effect using a combination of extensions. For example:

  • python-markdown allows you to embed (and execute) python code inline within a markdown cell by enclosing it in double braces (For example, I could say “{{ print(‘hello world’}}”);
  • hide_input_all is an extension that will hide code cells in a document and just display their executed output; it would be easy enough to tweak this extension to allow a user to select which cells to show and hide, capturing that cell information as cell metadata;
  • Readonly allows you to “lock” a cell so that it cannot be edited; using a notebook server that implements this extension means you can start to protect against accidental changes being made to a cell by mistake within a particular workflow; in a journalistic context, assigning a quote to a python variable, locking that code cell, and then referencing that quote/variable in a python-markdown might be one of working, for example.
  • Printview-button will call nbconvert to generate an HTML version of the current notebook – however, I suspect this does not respect the extension based customisations that operate on cell metadata. To do that, I guess we need to generate our outptut via an nbconvert custom template? (The Download As... notebook option doesn’t seem to save the current HTML view of a notebook either?)

So – my reading is: tools are there to support the editing side (inline code, marking cells to be hidden etc) of dynamic document generation, but not necessarily the rendering to hard copy side, which need to be done via nbconvert extensions?

Related: Seven Ways of Running IPython Notebooks

Paying for Dropbox and Other Useful Bits… (The Cost of Doing Business…)

A couple of years ago or so, Dropbox ran a promotion for academic users granting 15GB of space. Yesterday, I got an email:

As part of your school’s participation in Space Race, you received 15 GB of additional Dropbox space. The Space Race promotional period expires on March 4, 2015, at which point your Dropbox limit will automatically return to 5 GB.

As a friendly reminder, you’re currently using 14.6 GB of Dropbox space. If you’re over your 5 GB limit after March 4, you’ll no longer be able to save new photos, videos, and docs to Dropbox.

Need more space? Dropbox Pro gives you 1 TB of space to keep everything safe, plus advanced sharing controls, remote wipe for lost devices, and priority support. Upgrade before March 4 and we’ll give you 30% off your first year.

My initial thought was to tweet:

but then I thought again… The discounted price on a monthly payment plan is £5.59/month which on PayPal converted this month to $8.71. I use Dropbox all the time, and it forms part of my workflow for using Leanpub. As it’s the start of the month, I received a small royalty payment for the Wrangling F1 Data With R book. The Dropbox fee is about amount I’m getting per book sold, so it seems churlish not to subscribe to Dropbox – it is part of the cost of doing business, as it were.

The Dropbox subscription gets me 1TB, so this also got me thinking:

  • space is not now an issue, so I can move the majority of my files to Dropbox, not just a selection of folders;
  • space is not now an issue, so I can put all my github clones into Dropboxl
  • space is now now an issue, so though it probably goes against terms of service, I guess I could set up toplevel “family member” folders and we could all share the one subscription account, just selectively synching our own folders?

In essence, I can pretty much move to Dropbox (save for those files I don’t want to share/expose to US servers etc etc; just in passing, one thing Dropbox doesn’t seem to want to let me do is change the account email to another email address that I have another Dropbox account associated with. So I have a bit of an issue with juggling accounts…)

When I started my Wrangling F1 Data With R experiment, the intention was always to make use of any royalties to cover the costs associated with that activity. Leanpub pays out if you are owed more than $40 collected in the run up to 45 days ahead of a payment date (so the Feb 1st payout was any monies collected up to mid-December and not refunded since). If I reckon on selling 10 books a month, that gives me about $75 at current running. Selling 5 a month (so one a week) means it could be hit or miss whether I make the minimum amount to receive a payment for that month. (I could of course put the price up. Leanpub lets you set a minimum price but allows purchasers to pay what they want. I think $20 is the highest amount paid for a copy I’ve had to date, which generated a royalty of $17.50 (whoever that was – thank you :-)) You can also give free or discounted promo coupons away.) As part of the project is to explore ways of identifying and communicating motorsport stories, I’ve spent royalties so far on:

  • a subscription to GP+ (not least because I aspire to getting a chart in there!;-);
  • a subscription to the Autosport online content, in part to gain access to forix, which I’d forgotten is rubbish;
  • a small donation to sidepodcast, because it’s been my favourite F1 podcast for a long time.

Any books I buy in future relating to sports stats or motorsport will be covered henceforth from this pot. Any tickets I buy for motorsport events, and programmes at such events, will also be covered from this pot. Unfortunately, the price of an F1 ticket/weekend is just too much. A Sky F1 Channel subscription or day passes is also ruled out because I can’t for the life of me work out how much it’ll cost or how to subscribe; but I suspect it’ll be more than the £10 or so I’d be willing to pay per race (where race means all sessions in a race weekend). If my F1 iOS app subscription needs updating that’ll also count. Domain name registration (for example, I recently bought f1datajunkie.com) is about £15/$25 a year from my current provider. (Hmm, that seems a bit steep?) I subscribe to Racecar Engineering (£45/$70 or so per year), the cost of which will get added to the mix. A “big ticket” item I’m saving for (my royalties aren’t that much) on the wants list is a radio scanner to listen in to driver comms at race events (I assume it’d work?). I’d like to be able to make a small regular donation to help keep the ergast site on, but can’t see how to… I need to bear in mind tax payments, but also consider the above as legitimate costs of a self-employed business experiment.

I also figure that as an online publishing venture, any royalties should also go to supporting other digital tools I make use of as part of it. Some time ago, I bought in to the pinboard.in social bookmarking service, I used to have a flickr pro subscription (hmm, I possibly still do? Is there any point…?!) and I spend $13 a year with WordPress.com on domain mapping. In the past I have also gone ad-free ($30 per year). I am considering moving to another host such as Squarespace ($8 per month), because WordPress is too constraining, but am wary of what the migration will involve and how much will break. Whilst self-hosting appeals, I don’t want the grief of doing my own admin if things go pear shaped.

I’m a heavy user of RStudio, and have posted a couple of Shiny apps. I can probably get by on the shinyapps.io free plan for a bit (10 apps) – just – but the step up to the basic plan at $39 a month is too steep.

I used to use Scraperwiki a lot, but have moved away from running any persistent scrapers for some time now. morph.io (which is essentially Scraperwiki classic) is currently free – though looks like a subscription will appear at some point – so I may try to get back into scraping in the background using that service. The Scraperwiki commercial plan is $9/month for 10 scrapers, $29 per month for 100. I have tended in the past to run very small scrapers, which means the number of scrapers can explode quickly, but $29/month is too much.

I also make use of github on a free/open plan, and while I don’t currently have any need for private repos, the entry level micro-plan ($7/month) offers 5. I guess I could use a (private?) github rather than Dropbox for feeding Leanpub, so this might make sense. Of course, I could just treat such a subscription as a regular donation.

It would be quite nice to have access to IPython notebooks online. The easiest solution to this is probably something like wakari.io, which comes in at $25/month, which again is a little bit steep for me at the moment.

In my head, I figure £5/$8/month is about one book per month, £10/$15 is two, £15/$20 is three, £25/$40 is 5. I figure I use these services and I’m making a small amount of pin money from things associated with that use. To help guarantee continuity in provision and maintenance of these services, I can use the first step of a bucket brigade style credit apportionment mechanism to redistribute some of the financial benefits these services have helped me realise.

Ideally, what I’d like to do is spend royalties from 1 book per service per month, perhaps even via sponsored links… (Hmm, there’s a thought – “support coupons” with minimum prices set at the level to cover the costs of running a particular service for one month, with batches of 12 coupons published per service per year… Transparent pricing, hypothecated to specific costs!)

Of course, I could also start looking at running my own services in the cloud, but the additional time cost of getting up and running, as well as hassle of administration, and the stress related to the fear of coping in the face of attack or things properly breaking, means I prefer managed online services where I use them.

Ephemeral Citations – When Presentations You Have Cited Vanish from the Public Web

A couple of months ago, I came across an interesting slide deck reviewing some of the initiatives that Narrative Science have been involved with, including the generation of natural language interpretations of school education grade reports (I think: some natural language take on an individual’s academic scores, at least?). With MOOC fever in part focussing on the development of automated marking and feedback reports, this represents one example of how we might take numerical reports and dashboard displays and turn them into human readable text with some sort of narrative. (Narrative Science do a related thing for reports on schools themselves – How To Edit 52,000 Stories at Once.)

Whenever I come across a slide deck that I think may be in danger of being taken down (for example, because it’s buried down a downloads path on a corporate workshop promoter’s website and has CONFIDENTIAL written all over it) I try to grab a copy of it, but this presentation looked “safe” because it had been on Slideshare for some time.

Since I discovered the presentation, I’ve been recommending it to variou folk, particularly slides 20-22? that refer to the educational example. Trying to find the slidedeck today, a websearch failed to turn it up so I had to go sniffing around to see if I had mentioned a link to the original presentation anywhere. Here’s what I found:

no narrative science slideshow

The Wayback machine had grabbed bits and pieces of text, but not the actual slides…

wayback narrative science

Not only did I not download the presentation, I don’t seem to have grabbed any screenshots of the slides I was particularly interested in… bah:-(

For what it’s worth, here’s the commentary:

Introduction to Narrative Science — Presentation Transcript

We Transform Data IntoStories and Insight…In Seconds
Automatically,Without Human Intervention and at a Significant Scale
To Help Companies: Create New Products Improve Decision-MakingOptimize Customer Interactions
Customer Types Media and Data Business Publishing Companies Reporting
How Does It Work? The Data The Facts The Angles The Structure Stats Tests Calls The Narrative Language Completed Text Our technology platform, Quill™, is a powerful integration of artificial intelligence and data analytics that automatically transforms data into stories.
The following slides are examples of our work based upon a simple premise: structured data in, narrative out. These examples span several domains, including Sports Journalism, Financial Reporting, Real Estate, Business Intelligence, Education, and Marketing Services.
Sports Journalism: Big Ten Network – Data InTransforming Data into Stories
Sports Journalism: Big Ten Network – NarrativeTransforming Data into Stories
Financial Journalism: Forbes – Data InTransforming Data into Stories
Financial Journalism: Forbes – NarrativeTransforming Data into Stories
Short Sale Reporting: Data Explorers – JSON Input
Short Sale Reporting: Data Explorers – Overview North America Consumer Services Short Interest Update There has been a sharp decline in short interest in Marriott International (MAR) in the face of an 11% increase in the companys stock price. Short holdings have declined nearly 14% over the past month to 4.9% of shares outstanding. In the last month, holdings of institutional investors who lend have remained relatively unchanged at just below 17% of the companys shares. Investors have built up their short positions in Carnival (CCL) by 54.3% over the past month to 3.1% of shares outstanding. The share price has gained 8.3% over the past week to $31.93. Holdings of institutional investors who lend are also up slightly over the past month to just above 23% of the common shares in issue by the company. Institutional investors who make their shares available to borrow have reduced their holdings in Weight Watchers International (WTW) by more than 26% to just above 10% of total shares outstanding over the past month. Short sellers have also cut back their positions slightly to just under 6% of the market cap. The price of shares in the company has been on the rise for seven consecutive days and is now at $81.50.
Sector Reporting: Data Explorers – JSON Input
Sector Reporting: Data Explorers – OverviewThursday, October 6, 2011 12:00 PM: HEALTHCARE MIDDAY COMMENTARY:The Healthcare (XLV) sector underperformed the market in early trading on Thursday. Healthcarestocks trailed the market by 0.4%. So far, the Dow rose 0.2%, the NASDAQ saw growth of 0.8%, andthe S&P500 was up 0.4%.Here are a few Healthcare stocks that bucked the sectors downward trend.MRK (Merck & Co Inc.) erased early losses and rose 0.6% to $31.26. The company recentlyannounced its chairman is stepping down. MRK stock traded in the range of $31.21 – $31.56. MRKsvolume was 86.1% lower than usual with 2.5 million shares trading hands. Todays gains still leavethe stock about 11.1% lower than its price three months ago.LUX (Luxottica Group) struggled in early trading but showed resilience later in the day. Shares rose3.8% to $26.92. LUX traded in the range of $26.48 – $26.99. Luxottica Group’s early share volumewas 34,155. Todays gains still leave the stock 21.8% below its 52-week high of $34.43. The stockremains about 16.3% lower than its price three months ago.Shares of UHS (Universal Health Services Inc.) are trading at $32.89, up 81 cents (2.5%) from theprevious close of $32.08. UHS traded in the range of $32.06 – $33.01…
Real Estate: Hanley Wood – Data InTransforming Data into Stories
Real Estate: Hanley Wood – NarrativeTransforming Data into Stories
BI: Leading Fast Food Chain – Data InTransforming Data into Stories
BI: Leading Fast Food Chain – Store Level Report January Promotion Falling Behind Region The launch of the bagels and cream cheese promotion began this month. While your initial sales at the beginning of the promotion were on track with both your ad co-op and the region, your sales this week dropped from last week’s 142 units down to 128 units. Your morning guest count remained even across this period. Taking better advantage of this promotion should help to increase guest count and overall revenue by bringing in new customers. The new item with the greatest growth opportunity this week was the Coffee Cake Muffin. Increasing your sales by just one unit per thousand transactions to match Sales in the region would add another $156 to your monthly profit. That amounts to about $1872 over the course of one year.Transforming Data into Stories
Education: Standardized Testing – Data InTransforming Data into Stories
Education: Standardized Testing – Study RecommendationsTransforming Data into Stories
Marketing Services & Digital Media: Data InTransforming Data into Stories
Marketing Services & Digital Media: NarrativeTransforming Data into Stories


PS Slideshare appears to have a new(?) feature – Saved Files – that keeps a copy of files you have downloaded. Or does it? If I save a file and someone deletes it, will the empty shell only remain in my “Saved Files” list?

A Question About Klout…

I’ve no idea how Klout works out it’s scores, but I’m guessing that there is an element of PageRank style algorithmic bootstrapping going on, in which a person’s Klout score is influenced by the Klout score of folk who interact with a person.

So for example, if we look at @briankelly, we see how he influences other influential (or not) folk on Klout:

One thing I’ve noticed about my Klout scrore is that it tends to be lower than most of the folk I have an OU/edtech style relationship with; and no, I don’t obsess about it… I just occasionally refer to it when Klout is in the news, as it was today with an announced tie up with Bing: Bing and Klout Partner to Strengthen Social Search and Online Influence. In this case, if my search results are going to be influenced by Bing, I want to understand what effect that might have on the search results I’m presented with, and how my content/contributions might be being weighted in other peoples’ search results.

So here’s a look at the Klout scrores of the folk I’ve influenced on Klout:

Hmm… seems like many of them are sensible and are completely ignoring Klout. So I’m wondering: is my Klout score depressed relative to other ed-tech folk who are on Klout because I’m not interacting with folk who are playing the Klout game? Which is to say: if you are generating ranking scores based at least in part on the statistics of a particular netwrok, it can be handy to know what netwrok those stats are being measured on. If Klout stats are dominated by components based on networks statistics calculated from membership of the Klout network, that is very different to the sorts of scores you might get if the same stats were calculated over the whole of the Twitter network graph…

Sort of, but not quite, related: a few articles on sampling error and sample bias – Is Your Survey Data Lying to You? and The Most Dangerous Porfession: A Note on Nonsampling Error.

PS Hmmm.. I wonder how my Technorati ranking is doing today…;-)

In Passing, Quick Comments On An OER Powered Digital Scholarship Resource Site

digitalscholarship.ac.uk is a new OER powered resource site intended to support the use of digital tools in academic study. Resources are both tagged and organised into a series of learning topics:

(I’m not sure how the tags relate to the learning topics? That’s one for my to do list…;-)

Some really quick observations that struck me about the pages used to describe resources that are linked to from the site:

  1. I couldn’t immediately work out where the link to the resource actually was (it’s the dark blue backgrounded area top left, the sort of colour that for me gets pushed in to the background of a page and completely ignored, or the “open link in new window” link); I expected the value of the “Type of Media” attribute (eg “PDF download” in the above case) to be the link to the resource, rather than just a metadata statement.
  2. What on earth is going on with the crumb trail… [ 1 ] [ 2 ] etc to me represents page numbers of a kind (eg first page of results, second page of results) not depth in some weird hierarchy, as seems to be the case in this site?
  3. The comment captcha asks you “What is the code in the image?” Erm, code???? Do I have to decipher the captcha characters somehow (some Captchas offer a simple some for you to calculate, for example)? Erm… What do I do? What do I do??!?!!?
  4. I got a bit confused at first that the page was just using redirects rather than direct links to resources – the “Visit and rate this resource” link is a redirect that loads the target resource into a frame topped by a comment bar. (I didn’t spot, at first, that the ‘open in new window’ link points directly to the resource. And, erm, why would I open in a new window? New tab, maybe, though typically I choose to that via a right-click contextual menu action if I don’t want a page to load in the current tab?

The “Visit and Rate” link adds a topbar view over the actual resource page:

The “Add a comment” call to action (again, deep blue background which pushed it away from my attention) opens a comment page in a new tab… I think I’d have preferred this to open within the context of the top bar, so that I could refer directly to the resource within the context of the same tab whilst actually making a comment?

One final comment – and something I’ll try to get round to at some point: how do the resources look as a graph…?;-) It would be great if all the resources were available as data via an OPML feed, with one item per resource and also metadata describing the tags and Learning Topics identified, and then we could map how tags related to Learning Topics and maybe try to learn something from that. As a data file listing the resources doesn’t seem to be available, a scrape is probably called for… Here’s a recipe I’ll maybe try out at some point:

– scrape the tagcloud page for: tags and individual tag page URL; learning topic page URLs
– for each tag page and learning topic page, grab the URLs to resource pages, deduping along the way; this should grab us links to all resource pages, include pathological ones that have a tag but no topic, or a topic but no tag;
– for each resource page, scrape everything ;-)
– and finally: play with the data… maybe do a few graphs, maybe generate a custom search engine over the resources (or add it to my UK HEI Library Community CSE [about] etc etc)

PS Martin Weller has blogged a short note about the process used to identify OERs included in the site here: Two OER sites for researchers.

JISC Project Blog Metrics – Making Use of WordPress Stats. Plus, An Aside…

Brian has a post out on Beyond Blogging as an Open Practice, What About Associated Open Usage Data?, and proposes that “when adopting open practices, one should be willing to provide open accesses to usage data associated with the practices” (his emphasis).

What usage stats are relevant though? If you’re on a hosted WordPress blog, it’s easy enough to pull out in a machine readable way the stats that WordPress collects about your blog and makes available to you (albeit at the cost of revealing a blog specific API key in the URL. Which means that if this key provides access to anything other than stats, particularly if it provides write access to any part of your blog, it’s probably not something you’d really want to share in public… [Getting your WordPress.com Stats API Key])

That said, you can still hand craft your own calls to the WordPress stats API, and extract your own usage data as data, using the WordPress Stats API.

So for example, a URL of the form:
will pull in a summary of November’s views data; or:
will pull in a list of referrers.

For what it’s worth, I’ve started cobbling together a spreadsheet that can pull in live data, or custom ranged reports, from WordPress: WordPress Stats into Google Spreadsheets (make your own personal copy of the spreadsheet if you want to give it a try). This may or may not become a work in progress… at the moment, it doesn’t even support the full range of URL parameters/report configurations (for the time being at least, that is leaf “as an exercise for the reader”;-)

The approach I took is very simplistic, simply based around crafting URLs that grab specified sets of CSV formatted data, and pop them into a spreadsheet using the =importData() formula (I’m sure Martin could come up with something far more elegant;-); that said, it does provide an example of how to get started with a bit of programmatic URL hacking… and if you want to get started with handcrafting your own URLs, it provides a few examples there too….:-)

The pattern I used was to define a parameter spreadsheet, and then CONCATENATE parameter values to create the URLs; for example:

=importdata(CONCATENATE("http://stats.wordpress.com/csv.php?", "api_key=", Config!B2, "&blog_uri=", Config!B3, "&end=", TEXT(Config!B6,"YYYY-MM-DD"), "&table=referrers_grouped"))

One trick to note is that I defined the end parameter setting in the configuration sheet as a date type, displayed in a particular format. When we grab this data value out of the configuration sheet we’re actually grabbing a date typed record, so we need to use the TEXT() formula to put it into the format that the WordPress API requires (arguments of the form 2011-11-30).

If you want to use the spreadsheet to publish your own data, I guess one way would would be to keep the privacy settings private, but publish the sheets you are happy for people to see. Just make sure you don’t reveal your API key;-) [If you know of a good link/resource describing best practice around publishing public sheets from spreadsheets that also contain, and drawn on, private data, such as API keys, please post a link in the comments below;-)]

[A note on the stats: the WordPress stats made available via the API seem to relate to page views/visits to the website. Looking at my own stats, views from RSS feeds seem to be reported separately, and (I think) this data is not available via the WordPress stats API? If, as I do, you run your blog RSS feed through a service like Feedburner, to get a fuller picture of how widely the content on a blog is consumed, you’d need to report both the WordPress stats and the Feedburner stats, for example. Which leads the the next question, I guess: how can we (indeed, can we at all?) pull feed stats out of Feedburner?]

At this point, I need to come back to the question related above: what usage stats are relevant, particularly in the case of a JISC project blog? To my mind, a JISC project blog can support a variety of functions:

– it serves as a diary for the project team allowing them to record micro-milestones and solutions to problems; if developers are allowed to post to the blog, this might include posts at the level of granularity of a Stack Overflow Q and A, compared to the 500 word end-of-project post that tries to summarise how a complete system works;
– it can provide a feed that others can subscribe to to keep up to date with the project without having to hassle the project team for updates;
– it can provide context for the work by linking out to related resources, an approach that also might alert other projects who watch for trackbacks and pingbacks to the the project;
– it provides an opportunity to go fishing in a couple of ways: firstly, by acting as a resource others can link to (with the triple payoff that it contextualises the project further, it may suggest related work the project team are unaware by means of trackbacks/pingbacks into the project blog, and it may turn up useful commentary around the project); secondly, by providing a place where other interested parties might engage in discussion commentary or feedback around elements of the project, via blog comments.

Even if a blog only ever gets three views per post, they may be really valuable views. For me what’s important is how the blog can be used to document interesting things that might have been turned up in the course of doing the project that wouldn’t ordinarily get documented. Problems, gotchas, clever solutions, the sudden discovery or really useful related resources. The blog also provides an ongoing link-basis for the project, something that can bring it to life in a networked context (a context that may have a far longer life, and scope, than just the life or scope of the actual project).

For many projects that don’t go past a pilot, it may well be that the real value of the project is the blogged documentation of things turned up during the process, rather than any of the formal outputs… Maybe..?!;-)

PS in passing, Google Webmaster tools now lets you track search stats around articles Google associates you with as an author: Clicks and impressions for authors. It’s been some time since I looked at Google Webmaster tools, but as Ouseful.info is registered there, I thought I’d check my broken links…and realised just how many pages get logged by Google as containing broken links when a single post erroneously contains a relative link… (i.e. when the <a href=’ doesn’t start with http://)

PPS Related to the above is a nice example of why I think being able to read and write URL is an important skill, something Jon Udell also picks up on in Forgotten knowledge. In the above case, I needed to unpick the WordPress Stats APi documentation a little to work out how to put the URLs together (something that a knowledge of how to read and write URL helped me with). In Jon Udell’s case was an example of how a conference organiser was able to send a customised URL to the conference hotel that embedded the relevant booking dates.

But I wonder, in an age where folk use Google+search term (e.g. typing Facebook into Google) rather than URLs (eg typing facebook.com into a browser location bar), a behaviour that can surely only be compounded by the fusion of location and search bars in browsers such as Google Chrome, is “URL literacy” becoming even more of a niche skill, rather than becoming more widespread? Is there some corollary here to the world of phones and addressbooks? I don’t need to remember phone numbers any more (I don’t even necessarily recognise them) because my contacts lists masks the number with the name of the person it corresponds to. How many kids are going to lose out on a basic education in map reading because there’s no longer a need to learn route planning or map-based navigation – GPS, SatNav and online journey planners now do that for us… And does this distancing from base skills and low level technologies extend further? Into the kitchen, maybe? Who needs ingredients when you have ready meals (and yes, frozen croissants and gourmet meals from the farm shop do count as ready meals;-), for example? Who needs to actually use a cookery book (or really engage with a lecture) when you can watch a TV chef, (or TED Talks)..?

Who Do The Science Literati Listen to on Twitter?

I really shouldn’t have got distracted by this today, but I did; via Owen Stephens: seen altmetric – tracking social media & other mentions of academic papers (by @stew)?

Monthly Altmetric data downloads of tweets containing mentions of published articles are available for download from Buzzdata, so I grabbed the September dataset, pulled out the names of folk sending the tweets, and how many paper mentioning tweets they had sent from the Unix command line:

cut -d ',' -f 3 twitterDataset0911_v1.csv | sort |uniq -c | sort -k 1 -r > tmp.txt

Read this list into a script, pulled out the folk who had sent 10 or more paper mentioning updates, grabbed their Twitter friends lists and plotted a graph using Gephi to see how they connected (nodes are coloured according to a loose grouping and sized according to eigenvector centrality):

Connections between altmetric Sept. tweeps w/ 10+ updates

My handle for this view is that is shows who’s influential in the social media (Twitter) domain of discourse relating to the scientific topic areas covered by the Altmetric tweet collection I downloaded. To be included in the graph, you need have posted 10 or more tweets referring to one or more scientific papers in the collection period.

We can get a different sort of view over trusted accounts in the scientific domain by graphing the network of all the friends of (that is, people followed by) the people who sent 10 or more paper referencing tweets in September, as collected by altmetric, edges going from altmetric tweeps to all their friends. This is a big graph, so if we limit it to show folk followed by 100 or more of the folk who sent paper mentioning tweets and display those accounts, this is what we get:

FOlk followed by 100 or more followers of altmetricSept >9 updates folk

My reading of this one is that it show folk who are widely trusted by folk who post regular updates about scientific papers in particular subject areas.

Hmmm… now, I wonder: what else might I be able to do with the Altmetric data???

PS Okay – after some blank “looks”, here’s the method for the first graph:

1) get the September list of tweets from Buzzdata that contain a link to a scientific paper (as determined by Altmetric filters);
2) extract the Twitter screen names of the people who sent those tweets.
3) count how many different tweets were sent by each screen name.
4) extract the list of screen-names that sent 10 or more of the tweets that Altmetric collected. This list is a list of people who sent 10 or more tweets containing links to academic papers. Let’s call it ‘the September10+ list’.
5) for each person on the September10+ list, grab the list of people they follow on Twitter.
6) plot the graph of social connections between people on the Septemeber10+ list.

Okay? Got that?

Here’s how the second graphic was generated.

a) take the September10+ list and for each person on it, get the list of all their friends on Twitter. (This is the same as step 5 above).
b) Build a graph as follows: for each person on the September10+ list, add a link from them to each person they follow on Twitter. This is a big graph. (The graph in 6 above only shows links between people on the September10+ list.)
c) I was a little disingenuous in the description in the body of this post… I now filter the graph to only show nodes with degree of 100 or more. For folk who are on the September10+ list, this means that the sum of the people on the September10+ list, and the total number of people they follow is equal to or greater than 100. For folk not on the September10+ list, this means that they are being followed by people with a degree of 100 or more who are on the September10+ list (which is to say they are being followed by at least 100 or so people on the September10+ list; I guess there could be folk followed by more than 100 people on the September10+ list who don’t appear in the graph if, for example, they were followed by folk in the original graph who predmoninantly had a degree of less than 100?).
d) to plot word cloud graphic above, I visualise the filtered graph and then hide the nodes whose in-degree is 0 (that is, they aren’t followed by anyone else in the graph).

Got that? Simples… ;-)