OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Click-Scheduled Forum Posts

with 6 comments

Course writing always seems to take me forever for a wide variety of reasons.

The first is the learning. You’re presumably familiar with the old saw about the teacher being one page ahead..? That’s the teacher as expert learner modeling the learning process, the teacher as “teaching on” something they’ve just learned. The teacher-as-learner experiencing just where the stumbling blocks are, or noticing afresh the really big idea… That’s me, learning on the one hand, and on the other trying to justify with footnotes and references the things I’ve learned by experience and that just feel right!

The second thing is the process of course production, the tooling used to support it, and the things we could try out. I’ve spent a lot of time thinking about virtual machines and containers recently, because I think they could be good for the OU and good for the School of Data. It doesn’t surprise me that Coursera courses are using virtual machines, and that Futurelearn isn’t. I think there’s interesting stuff – and not a few business opportunities – in looking at ways of supporting the organisational and end-user creation of VMs, particularly in education where you’re regularly presented with having to find tech solutions – and support – for cohorts of anything between 15 and 1,500 (or with the MOOCs, 15,000). I’ve also spent a lot of time pondering IPython Notebooks – and need to spend more time doing so: literate programming, conversations with data, end user application development, task based computing, and a comparison with the attractiveness of spreadsheets are all in the mix.

The third thing is time spent keeping a learning diary of what’s been going on in the course production process. I haven’t done this this time and I regret it (“no time to blog” because of “deadlines”; course goes to students in 15J, October 2015, (sic, i.e. next year), so the pressure is on to make the deadline (I won’t) for a full first draft handover by tomorrow). So f*****g that for a game of soldiers, I am taking an hour out and writing up a thought…

In particular, this one…

I’m struggling (again) with ways of trying to encourage sharing and discussion amongst the students. A default way of doing this is to have a call out (a “call to action”) from online teaching materials into a forum or forum thread. You know the sort of thing: read this, play with that, share your findings in the forum. Only hopefully a bit more engaging than that.

The problem is, if you are going to link out to a specific thread from course materials, you need to seed the forums. Which means if you have a lot of callouts, the forums can start to get cluttered with stub posts, and overload a nascent forum with irrelevant (at the then time) content.

doodling threads

One way around this is to schedule posts to appear in the forums for around the time you expect students to be reaching out to them. This can make hard linking difficult, unless you can publish a post, get the link, unpublish it and schedule it, and then hope when it does get re-released that the link is the same. (If the URL is minted against a post ID, this should work?) A downside of this approach that if a student clicks on a forum call out link and the post hasn’t yet been republished according to its scheduled date, the link will be broken.

Reflecting on the way wikis work, where you can create a link to a wiki page that doesn’t exist yet, and that page is created when it’s clicked for the first time, I started to wonder about a similar mechanism for links to forum based social activities. So for example, I create a forum post with a scheduled date that publishes the post on a particular date if it hasn’t already been published yet and check the clickpub box. I’m presented with a URL for the post that is guaranteed to be the URL it’s given when it the post does get published.

In my course materials, I paste the link.

If no-one ever clicks the link to that forum post in the course materials, the post is published in the forum on the scheduled date. The post should contain a description of the activity and a reference back to the activity in the course materials, as well as act as a stub for a discussion around the activity or sharing of social objects associated with the activity. In this mode, the post acts as call-to-action from the forum to the course materials, supporting the pacing of the course.

Some students, however, like to get ahead. So if they click on the link before the schedule date they need to see the post somehow.

The first way to achieve this is to use the link in the course materials a bit like a new wiki page link: if a student clicks on a link to a post before the post is scheduled to be published, the click sends a hurry-up, clickpub message that fires the publication of the post. This actually signals two things: one to the course team that someone is that far ahead in working through the course materials, the other to the rest of the cohort that somebody is that far ahead in working through the materials.

(Note that we need to defend against link checkers (human or machine) that might be operating in the VLE accidentally triggering a clickpub event!)

Problems may arise in the case of the student who tries to do the whole 30 week course in the first 10 days after it is opened up. (Unless such students are anti-social and don’t post to such forums, in part because they know it’s unlikely that anyone else will be as far on and keen to discuss the topic. That said, even posting with no hope of reply is often beneficial in the way it forces a little bit of reflective thinking at least.)

To try to mitigate against early publication of a post, we could try a more refined strategy in which a social activity thread is only viewable to students who click on the link to it before the scheduled release date, but is then released openly to the forum at its scheduled time.

We could balance this further with the proviso that if more than x% of the cohort have accessed the thread, it’s scheduled release date is brought forward to that time. In this way, we can start to use the social activity posts as one way of trying to keep the cohort together, for example in cases where the majority of the cohort is working through the course faster than expected.

[Unfiled patents: 3,487; 3,488; 3,489 ...;-) - Unless of course these sorts of mechanism already exist? If so, please let me know where via the comments below:-)]

Written by Tony Hirst

June 3, 2014 at 9:57 am

Posted in OU2.0

Tagged with ,

Recreational Data: Data Golf

with one comment

I’m still hopeful of working up the idea of recreational data as a popular pastime activity with a regular column somewhere and a stocking filler book each Christmas (?!;-), but haven’t had much time to commit to working up some great examples lately:-(

However, here’s a neat idea – data golf – as described in a post by Bogumił Kamiński (RGolf) that I found via RBloggers:

There are many code golf sites, even some support R. However, most of them are algorithm oriented. A true RGolf competition should involve transforming a source data frame to some target format data frame.

So the challenge today will be to write a shortest code in R that performs a required data transformation

An example is then given of a data reshaping/transformation problem based on a real data task (wrangling survey data, converting it from a long to a wide format in the smallest amount of R.

Of course, R need not be the only language that can be used to play this game. For the course I’m currently writing, I think I’ll pitch data golf as a Python/pandas activity in the section on data shaping. OpenRefine also supports a certain number of reshaping transformations, so that’s another possible data golf course(?). As are spreadsheets. And so on…

Hmmm… thinks… pivot table golf?

Also related: string parsing/transformation or partial string extraction using regular expressions; for example, Regex Tuesday, or how about Regex Crossword.

Written by Tony Hirst

May 23, 2014 at 10:27 am

Posted in Data, Rstats, School_Of_Data

Tagged with

Give the kids the power tools…

with one comment

How do we trade off giving students ridiculously powerful applications to play with, guarding against the “dangers” of doing so and the requirement to spend time teaching the tool rather than the concepts, versus just teaching them the concepts and hoping they’ll be able to see how to wield those concepts as they are manifest in and by the power tools?

“I … cannot disagree strongly enough with statements about the dangers of putting powerful tools in the hands of novices. Computer algebra, statistics, and graphics systems provide plenty of rope for novices to hang themselves and may even help to inhibit the learning of essential skills needed by researchers. The obvious problems caused by this situation do not justify blunting our tools, however. They require better education in the imaginative and disciplined use of these tools. And they call for more attention to the way powerful and sophisticated tools are presented to novice users.”
Leland Wilkinson, The Grammar of Graphics, Springer-Verlag, 1999, ISBN 0-387-98774-6, p15-16.

+1, as they say; +1.

Written by Tony Hirst

May 19, 2014 at 11:42 pm

Posted in Anything you want

Confused Again About VM Ecology… I Blame Not Blogging

leave a comment »

Via a cc’d tweet from Martin Hawksey, this lovely post from Tom Smith/@everythingabili (who has the best ever twitter bio strapline) on How I Learn ( And What I’m Learning ).

I like to think that I used to write blog posts that had the same sort of sense as that post…

…but for the last few months at least, I don’t think I have.

“Working” for once – starting production on an OU course (TM351, due out October 2015 (sic; I’m gonna be late on the first draft of the 7 weeks of the course I’m responsible for: it’s due in a little over a fortnight…), and also helping out on an investigative project the School of Data is partnering on – has meant that most of the learnings and failings that I used to blog about have been turned inward to emails (which are even more poorly structured in terms of note-taking than this blog is) if at all.

Which is a shame and makes me not happy.

Reading through completed academic papers, making useful (I hope) use of them in the course draft, has been quite fun in part – and made me regret at times not writing up work of my own in a self-contained, peer reviewed way over the last decade or so; getting stuff “into the record”, properly citable, and coherent enough to actually be cited. But as you pick away at the papers, you know they’re a revisionist telling, a straightforward narrative of how the pieces fit together and in which nothing went wrong along the way; (you also know that the closer you get to trying to replicate a paper, the harder it is to find the missing pieces (process, trick, equation or insight) that actually make it work; remember school maths, or physics, and the textbook that goes from one equation to the next with a “hence”, but there’s no way in hell you can figure out how to make that step and you know you’ll be stuck when that bit comes up in the exam…?! That. Or worse. When you bang your head against a wall trying to get something to work, contort your mental models to make it work, sort of, then see the errata list correcting that thing. That, too.)

On the other hand, this blog is not coherent, shapes no whole, but is full of hence steps. Disjointed ones, admittedly. But they keep a track of all the bits I tried at and failed at and worked around, and they keep on getting me out of holes… Like the emails won’t. Lost. Wasted effort because the learning fumblings that are OUseful learning fumblings are lost and locked up in email hell.

It makes me very not happy.

So that, by way of intro, to this: a quick catchup follow-up to Cursory Thoughts on Virtual Machines in Distance Education Courses and Doodling With IPython Notebooks for Education, a partial remembering of the various shades of hell associated with them and trying to share them.

Here’s what I think I now want to do (whether or not it’s the right thing I’m not sure).

  • generate a script that will build a VM. We’ve opted for Virtualbox as the VM container. The VM will need to contain: pandas; IPython notebook (course team want it to run Python 3.3. I’ve lost track of how many hours I’ve spent trying and failing to get Python libraries I think we need trying to run under Python 3.3; wasted effort; I should have settled with Python 2.7 and then revisited 3.3 in several months hence; the 2.7 3.3 tweaks to any code we write for the course should manageable in migration terms. Pratting around on libraries that I’m guessing will get patched as native distributions move to 3.3 by default but don’t work yet is wasted effort. October. 2015. First presentation.); PostgreSQL (perhaps with some extensions); mongodb; ipythonblocks; blockdiag; I came across shellinabox today and wonder if we should run that; OpenRefine (CT against this – I think it’s good for developing mental models); python-nvd3; folium; a ggplot port to python; (CT take – too much new stuff; my take, we should go as high up the stack as we can in terms of the grammar of the calling functions); I think we should run R and RStudio too to make for a really useful VM, making the point that the language doesn’t matter and we use whatever’s appropriate to get things done, but I don’t think anyone else does. if. Which computer language is that from then? for. Or that one? How about in? print? Cars are for driving. Mine’s blue. I have no idea how it works. Can I drive your car? The red one. With the left-hand drive.
  • access the services running on the headless VM via a browser on host. I think we should talk to the databases using Python, but I may be in the minority. We may need something more graphical to talk to postgresql. If we do, I would argue it should be a browser based client – if it’s an app, we’re moving functionality provision outside of the VM.
  • use the script to build to machines with the same package config; CT seem to prefer a single build on a 32 bit machine. I think we should support 64 bit as well. And deployment on at least one cloud service – I;d go for Amazon, but that’s mainly because it’s the only one I’ve tried. If we could support more, even better.
  • as far as maintenance goes, I wrote the vagrant script to update libraries whenever the provisioner is run (which is quite a lot at the mo as I keep finding new things to add to the box!;-) This may or may not be sensible for student use. If there is a bug in a library, an update could help. If there is a security patch to the kernel, we should be updating as good citizens. The current plan is to ship a built box (which I think would have to go on to a USB stick – we can’t rely on folk having computers with a DVD any more, and a 1.5GB download seems to be proving unreliable without a proper download manager. As it is, students will have to download virtualbox and vagrant, and install those themselves. (Unless we can put installers for them on a USB stick too.) If we do ship a built box, we need to think of some way of kickstarting the services and perhaps rebooting the machine (and then kickstarting the services). There is a separate question of whether we should be also be able to update config scripts during presentation. This would presumably have to be done on the host. One way might be to put config scripts on a git server then use git to keep the config scripts on the students’ host machine up to date, but that would probably also require them to install a git commandline tool, even if we automated the rest. Did I say this all has to run cross platform? Students may be on Windows (likely?), Mac or Linux. I think the course should be doable, if painfully, via a tablet, which means the VM needs the cloud hosted option. If we could also find a way to help students configure their whatever platform host so that they could access services from the VM running on it via their tablet, so much the better.
  • files need to be shared between VM and host. This raises an interesting issue for a cloud hosted VM. Either, we need to find a way to synch files between desktop and cloud VM, persist state on the cloud host so that the VM can synch to it, or pop dropbox into the cloud VM (though there would then be a synch delay, as there would with a desktop synch). I favour persisting on the cloud, though there is then the question of the student who is working on a home machine one day and a cloud machine the next.
  • Starting and stopping services: students need to be able to start and stop services running on the VM without having to ssh in. Once click easy. A dashboard with buttons that show if a service is running or not, click a button to toggle the run state of the the service. No idea how to do this.

Here’s the approach I’ve taken:

  • reusing DataminerUK’s infinite-interns model as a starting point, I’m using vagrant to build and provision a VM using puppet. At the moment I have duplicate setups for two different Linux base boxes (precise64 and raring64. The plan is to move to the latest Ubuntu LTS.) I really should have a single setup with the different machine setups called by name from a single Vagrantfile. I think.
  • The puppet provisioner builds the box from a minimal base and starts the services running. It’s aggressive on updates. The precise64 box is running python 2.7 and the raring64 box 3.3. Getting pip3 running in the raring box was a pain, and I don’t know how to tell puppet to use the pip3 thing I eventually installed to update. At the moment I fudge with:
    exec { "pip3-update":
    command => "/usr/local/bin/pip3 install --upgrade pip"
    }

    but it breaks because I’m not convinced that is always the right path (I’d like to hedge on /usr/bin:/usr/local/bin), or that pip3 is actually installed when I try to exec it… I think what I really want to do is something like
    package {
    [
    JUST UPGRADE YOURSELF, PLEASE
    ]: ensure => latest,
    provider => 'pip3';
    }

    with an additional dependency check (=>) that pip3 has been installed first, and from all the other pip3 installs that pip3 has been upgraded first.
  • The IPython notebook is started by a config shell script called from puppet. I think I’m also using a config script to set up a user and test tables in Postgres (though I plan to move to the puppet extension as soon as I can get round to reading the docs/finding out how to do it).
  • There are times when a server needs restarting. At the moment I have to run vagrant provision to do that – or even vagrant halt;vagrant up, which means it also runs the updates. It’d be nice to just be able to run the bits that restart the services (the DBMS’, IPython notebook etc) without doing any of the package updates, installs, checks etc.
  • We need a tool to check whether services are running on the necessary ports to help debugging without requiring a user to SSH into the VM; I’ve also fixed on default ports. We really need to change ports if a default port is being used to a free port and then somehow tell the IPython notebook scripts which port each service is running on. With vagrant letting you run a VM from within a particular directory, being able to know what VMs are being run and from where, wherever vagrant on host started them, would be useful.
  • I don’t use a configurator for the postgres db (it needs seeding with some example tables) but should do – on my to do list is to look at https://github.com/puppetlabs/puppetlabs-postgresql . Similarly for mongo db – and perhaps https://github.com/puppetlabs/puppetlabs-mongodb
  • To make use of python-nvd3, suggested route is to use bower. I got the npm package manager to work but have failed to find a way of installing any actual packages [issue].

Issues to date, aside from things like port clashes and all manner of f**k ups because I distributed a README with an error in it and folk tried to follow it rather than patches posted elsewhere, have been impeded by not having a good way of logging and sharing errors. OU network issues have also contributed to the fun. I always avoid the OU staff network, but nothing seems to work on that. I suspect this is a proxy issue, but I’m unwilling to invest any time exploring it or how to config the VM to cope (no-one else has offered to go down this rat hole). Poxy proxies could well be an issue for some students, but I’m not sure what the best way forward is. Cloud hosted VMs?!

I also had a problem on OU eduroam – mongodb wants to get something signed from a keyserver before it will install mongodb, but OU eduroam (the network I always use) seems to block the keyserver. Tracking that down wasted a good hour.

Here are some other things I’ve heard about:

- https://github.com/psychemedia/notebookcloud This is cloned from https://github.com/carlsmith who appears to have taken his repo – and the app – down? It provided a dashboard for firing up notebook servers on Amazon cloud. If I hadn’t been ‘working’ I’d have blogged screenshots and the workflow. As it is, all I have are vague memories of how it worked and what it did and the ideas that sprung off of having an artefact to talk around. [Hmm... app seems to have come back up - maybe I caught it at a bad time... https://notebookcloud.appspot.com/login ]

- provisioning things: chef, vagrant, puppet, docker.

What should I be using for what?

I thought about different VMs for different services, but that adds too much VM weight and requires networking between the VMs, we could lead to “support issues”. Would docker help here? A base VM built from vagrant and puppet, then docker to put additional machines on top.

What I want is students to be able to:

- install minimum number of third party apps on their desktop (currently virtualbox and vagrant)
– click one button get their VM. My feeling is we should have a largely prebuilt box on a USB stick they can use as a base box, then do a top up build and provision. I suspect CT would like one click somewhere to fire up a machine, get services running, and open a tab to the IPython notebook in their browser, maybe a status button somewhere, a single button to restart any services that have stopped and options to suspend or shutdown machines. In terms of student workflow, I think suspending and resuming machines (if services can resume appropriately) would be the neatest workflow. Note: the course runs over 9 months…
– be able to access files on host that are used in the VM. If they are using multiple VMs (eg on cloud and desktop) to find a sensible way of synching notebooks and data/database state across those machines; which could be tricky at least as far as database state goes.
– if a student is not using postgresql or mongo – and they won’t for the first 8 weeks of the course – it could be useful to run the VM without those services running (perhaps aside from a quick self-test in week 1 so we can check out any issues as early as possible and give ourselves a week or two to come up with any patches before those apps are used in anger). So maybe a control panel to fire up the VM and the services you want to run. Yes mongo, no postgresql. No DB at all. And so on. Would docker help here? Decide on host somehow which services to run, fire up the VM, then load in and start up the required services. Next session, change which services are running in the VM?

All in all, I reckon I’m between 20 and 40% there (further along in terms of time?) but not sure how much further I can push it to the desired level of robustness without learning how to do this stuff a bit more properly… I’m also not really sure what’s practically and reliably possible and what’s best for what. We have to maximise the robustness of stuff ‘just working’ and minimise support issues because students are at a distance and we don’t know what platform they’re on. I think if I started from scratch and rewrote the scripts now they’d be a lot clearer, but I know that’d take half a day and the functional return – for now – I think would be minimal.

That said, I’ve done a fair amount of learning, though large chunks of it have been lost to email and not blogging. Which is a shame. Must do better. Must take public notes.

Written by Tony Hirst

May 15, 2014 at 11:37 pm

Posted in Anything you want

Tagged with

Innovation’s End

In that oft referred to work on innovation, The Innovator’s Dilemma, Clayton Christensen suggested that old guard companies struggle to innovate internally because of the value networks they have built up around their current business models. Upstart companies compete around the edges, providing cheaper but lower quality alternative offerings that allow the old guard to retreat to the higher value, higher quality products. As the upstarts improve their offerings, they grow market share and start to compete for the higher value customers. The upstarts also develop their own value networks which may be better adapted to an emerging new economy than the old guard’s network.

I don’t know if this model is still in favour, or whether it has been debunked by a more recent business author with an alternative story to sell, but in its original form it was a compelling tale, easily co-opted and reused, as I have done here. I imagine over the years, the model has evolved and become more refined, perhaps offering ever more upmarket consultancy opportunities to Christensen and his acolytes.

The theory was also one of the things I grasped at this evening to try to help get my head round why the great opportunities for creative play around the technologies being developed by companies such as Google, Amazon and Yahoo five or so years ago don’t seem to be there any more. (See for example this post mourning the loss of a playful web.)

The following screenshots – taken from Data Scraping Wikipedia with Google Spreadsheets – show how the original version of Google spreadsheets used to allow you to generate different file output formats, with their own URL, from a particular sheet in a Google spreadsheet:

publishAsWebpage

publishFormats

publishOPtions

morePublishOptions

In the new Google spreadsheets, this is what you’re now offered from the Publish to Web options:

googleshshare

[A glimmer of hope - there's still a route to CSV URLs in the new Google spreadsheets. But the big question is - will the Google Query language still work with the new Google spreadsheets?]

embed changes everything

(For some reason, WordPress won’t let me put angle brackets round that phrase. #ffs)

That’s what I said in this video put together for a presentation to a bunch of publishers visiting the OU Library at an event I couldn’t be at in person (back when I used to be invited to give presentations at events…)

I saw embed as a way that the publishers could retain control over content whilst still allowing people to access the content, and make it accessible, in ways that the publishers wouldn’t have thought of.

Where content could be syndicated but remain under control of the publisher, the idea was that new value networks could spring up around legacy content, and the publishers could then find a way to take a cut. (Publishers don’t see it that way of course – they want it all. However big the pie, they want all of it. If someone else finds a way to make the pie bigger, that’s not interesting. My pie. All mine. My value network, not yours, even if yours feeds mine. Because it’s mine. All mine.)

I used to build things around Amazon’s API, and Yahoo’s APIs, and Google APIs, and Twitter’s API. As those companies innovated, they built bare bones services that they let others play with. Against the established value network order of SOAP and enterprise service models let the RESTful upstarts play with their toys. And the upstarts let us play with their toys. And we did, because they were easy to play with.

But they’re not anymore. The upstarts started to build up their services, improve them, entrench them. And now they’re not something you can play with. The toys became enterprise warez and now you need professional tools to play with them. I used to hack around URLs and play with the result using a few lines of Javascript. Now I need credentials and heavyweight libraries, programming frameworks and tooling.

Christensen saw how the old guard, with their entrenched value networks couldn’t compete. The upstarts had value networks with playful edges and low hanging technological fruit we could pick up and play with. The old guard entrenched upwards, the upstarts upped their technology too, their value networks started to get real monetary value baked in, grown up services, ffs stop playing with our edges and bending our branches looking for low hanging fruit, because there isn’t any more. Go away and play somewhere else.

Go away and play somewhere else.

Go somewhere else.

Lock (y)our content in, Google, lock it in. Go play with yourself. Your social network sucked and your search is getting ropey. You want to lock up content, well so does every other content generating site, which means you’re all gonna be faced with the problem of ranking content that all intranets face. And their searches universally suck.

The innovator’s dilemma presented incumbents with the problem of how to generate new products and business models that might threaten their current ones. The upstarts started scruffy and let people play alongside, let people innovate along with them. The upstarts moved upwards and locked out the innovation networks around them. Innovations end. Innovation’s end. Innovation send. Away.

< embed > changes everything. Only this time it’s gone the wrong way. I saw embed as a way for us to get their closed content. Now Google’s gone the other way – open data has become an embedded package.

“God help us.” Withnail.

PS Google – why did my, sorry, your Chrome browser ask for my contacts today? Why? #ffs, why?

Written by Tony Hirst

May 2, 2014 at 11:47 pm

Posted in Anything you want

Tagged with

Personal Recollections of the “Data Journalism” Phrase

@digiphile’s being doing some digging around current popular usage of the phrase data journalism – here are my recollections…

My personal recollection of the current vogue is that “data driven journalism” was the phrase that dominated the discussions/community I was witness to around early 2009, though for some reason my blog doesn’t give any evidence for that (must take better contemporaneous notes of first noticings of evocative phrases;-). My route in was via “mashups”, mashup barcamps, and the like, where folk were experimenting with building services on newly emerging (and reverse engineered) APIs; things like crime mapping and CraigsList maps were in the air – putting stuff on maps was very popular I seem to recall! Yahoo were one of the big API providers at the time.

I noted the launch of the Guardian datablog and datastore in my personal blog/notebook here – http://blog.ouseful.info/2009/03/10/using-many-eyes-wikified-to-visualise-guardian-data-store-data-on-google-docs/ – though for some reason don’t appear to have linked to a launch post. With the arrival of the datastore it looked like there were to be “trusted” sources of data we could start to play with in a convenient way, accessed through Google docs APIs:-) Some notes on the trust thing here: http://blog.ouseful.info/2009/06/08/the-guardian-openplatform-datastore-just-a-toy-or-a-trusted-resource/

NESTA did an event on News Innovation London in July 2009, a review of which by @kevglobal mentions “discussions about data-driven journalism” (sic on the hyphen). I seem to recall that Journalism.co.uk (@JTownend) were also posting quite a few noticings around the use of data in the news at the time.

At some point, I did a lunchtime at the Guardian for their developers – there was a lot about Yahoo Pipes, I seem to remember! (I also remember pitching the Guardian Platform API to developers in the OU as a way of possibly getting fresh news content into courses. No-one got it…) I recall haranguing Simon Rogers on a regular basis about their lack of data normalisation (which I think in part led to the production of the Rosetta Stone spreadsheet) and their lack of use (at the time) of fusion tables. Twitter archives may turn something up there. Maybe Simon could go digging in the Twitter archives…?;-)

There was a session on related matters at the first(?) news:rewired event in early 2010 but I don’t recall the exact title of the session (I was in a session with Francis Irving/@frabcus from the then nascent Scraperwiki) http://blog.ouseful.info/2010/01/14/my-presentation-for-newsrewired-doing-the-data-mash/ Looking at the content of that presentation, it’s heavily dominated by notions of data flow; the data driven journalism (hence #ddj) phrase, seemed to fit this well.

Later that year, summer, was a roundtable event hosted by the ECJ on “data driven journalism” – I recall meeting Mirko Lorenz there (who maybe had a background in business data? and since helped launch datawrapper.de) and Jonathan Gray – who then went on to help edit the Data Journalism handbook – among others.

http://blog.ouseful.info/2010/08/25/my-slides-from-the-data-driven-journalism-round-table-ddj/

For me the focus at the time was very much on using technology to help flow data into useable content, (eg in a similar but perhaps slightly weaker sense than the more polished content generation services that Narrative Science/Automated Insights have since come to work on, or other data driven visualisations or what I guess we might term local information services; more about data driven applications with a weak local news/specific theme or issue general news relevance, perhaps). I don’t remember where the sense of the journalist was in all this – maybe as someone who would be able to take the flowed data, or use tools that were being developed to get the stories out of data with tech support?

My “data driven journalism” phrase notebook timeline

http://blog.ouseful.info/?s=%22data%20driven%20journalism%22&order=asc

My “data journalist” phrase notebook timeline

http://blog.ouseful.info/?s=%22data%20journalist%22&order=asc

My first blogged used of the data journalism phrase, in quotes, as it happens, so it must have been a relatively new sounding phrase to me, was here: http://blog.ouseful.info/2009/05/20/making-it-a-little-easier-to-use-google-spreadsheets-as-a-database-hopefully/ (h/t @paulbradshaw)

Seems like my first use of the “data journalist” phrase was in picking up on a job ad – so certainly the phrase was common to me by then.

http://blog.ouseful.info/2010/12/04/what-is-a-data-journalist/

As a practice and a commonplace, things still seemed to be developing in 2011 enough for me to comment on a situation where the Guardian and Telegraph teams were co-opetitively bootstrapping each other’s ideas: http://blog.ouseful.info/2011/09/13/data-journalists-engaging-in-co-innovation/

I guess the deeper history of CAR, database journalism, precision journalism may throw off trace references, though maybe not representing situations that led to the phrase gaining traction in “popular” usage?

Certainly, now I’m wondering what the relative rise in popularity of “data journalist” versus “data journalism” was? Certainly, for me, “data driven journalism” was a phrase I was familiar with way before the other two, though I do recall a sense of unease about it’s applicability to news stories that were perhaps “driven” by data more in the sense of being motivated or inspired by it, or whose origins lay in a data set, rather than “driven” in a live, active sense of someone using an interface that was powered by flowing data.

Written by Tony Hirst

April 29, 2014 at 9:36 am

Posted in Anything you want

Tagged with

Writing Diagrams – Boxes and Arrows

If you’ve ever had to draw “blocks and arrows” diagrams, you’ll know how irritating it can be if you spend hours laying out the diagram using a presentation editor or drawing tool, only to find you need to edit the drawing, add another box, and lay the whole thing out again.

Surely there must be a better way?

Let’s just think about what a box and arrow diagram is intended to show: when describing a process, the connections typically represent a flow from one thing to another; furthermore, the layout is often rectilinear, laid out along straight lines, the boxes tidily spaced and their edges lined up with each other. In a diagram such as a mindmap, different ideas or concepts are related to each other by drawing a line between them and the layout may be more fluid, with like or related concepts grouped together in space, or by the additional use of colour themes, for example.

The primary information contained in the diagram are the text elements and the connections between them. The positioning on the page often reflects the structure of these connections. When we lay out a diagram, we unconsciously favour layouts that minimise the number of crossed lines (to keep the diagram “clean” looking), and group connected items close together (unless some other information requires us to separate them – for example, we might be using a timeline basis for a horizontal x-axis and placing boxes in areas of the canvas we are working on that are associated with a particular month).

The online Google Drawing document type is typical of drawing tools included in many office applications. As well as being able to draw boxes and connect registration points on each box by lines or arrows, a range of layout tools provides support for aligning and spacing boxes.

google draw

Tools such as popplet provide a friendlier environment for generating similar sorts of diagram:

popplet

Whilst drawing tools such as these allow you to craft your diagram by hand, building it up as you go along, actually putting previously collected information into blocks on the canvas, let alone connecting the blocks together and laying them out nicely, may be quite an involved and error prone affair.

In these circumstances, it may make more sense to take a raw representation of the block contents and a simple representation of connections between appropriate blocks and just write the relationships down, letting a drawing tool do the hard work of drawing the blocks, connecting them together and laying them out, at least in draft layout form. To provide for a final layer of customisation, it might also be useful to be able to take a vector/SVG representation of the automatically sketched layout into a drawing package where it can be tidied up by hand and the application of a human designer’s eye.

There are several online tools available that you can use to sketch box and arrow diagrams from simple text descriptions.

Text2mindmap

Text2Mindmap allows you to construct tree based mindmaps from a simple outline style description of the mindmap.

text2mindmap

The layout has a radial basis. Designs can be saved and images downloaded as JPG or PDF files.

Diagrammr

Diagrammr allows you to draw simple graph based network structures in which text labelled block elements can be connected to other blocks by labelled edges.

diagrammr

Designs are given a persistent URL, but anyone with access to the URL can edit the diagram.

JS Sequence Diagrams

JS Sequence Diagrams is a javascript library for generating sequence diagrams from a textual description of them.

jssequencediagrams

Diagrams can be saved as SVG files. The source code is available and depends on Jison and Raphaël as the graphics library.

blockdiag

blockdiag (application) uses a language similar to the DOT language for describing a range of block diagram types (block diagrams, sequence diagrams, activity diagrams, logical network diagrams).

blockdiag2

Diagrams can be saved as SVG diagrams and associated with a URL that contains all the information used to recreate the diagram. As such, large diagrams are not supported if they make the URL too long. Source code is available.

Several diagram types are available using blockdiag, including graphviz diagrams constructed using the DOT language.

GraphvizFiddle

GraphvizFiddle is a fiddler style application that lets you enter ,a href=”http://www.graphviz.org/content/dot-language”>Graphviz DOT language descriptions and preview the result.

graphvizfiddle

Files can be generated in SVG format, or a textual definitions (for example, in the dot layout language).

Summary

Generating boxes and arrows style diagrams can be a pain at times because the semantics of the diagram – how one item is related to another – is represented in a graphical rather than data based form. By writing down the relations and then automatically generating visual representations of them, we retain access to the data representation whilst letting the hard work of generating the initial draft, at least, of the layout to a machine.

Several tools are available to support the creation of such literally described box and arrow diagrams using a variety of description languages and generating a range of output image formats (SVG probably being the most useful if you need to edit the sketch diagram for yourself to tweak the layout for its final presentation). Code for some of the tools (JS Sequence diagrams, blockdiag) is available.

Arguably the most powerful tools allow you to “write” diagrams using the Graphviz DOT layout language. Whilst there is a certain overhead associated with learning this language, it does save time in the long run if you regularly need to create network style diagrams. Graphviz also supports a range of layout algorithms – see the Graphviz gallery for examples.

PS If you want to write your own diagramming application, the JointJS library looks like a handy library to have on hand… The Venn.js library also looks quite pretty – if you have to generate Venn diagrams, that is!

Written by Tony Hirst

April 28, 2014 at 1:57 pm

Posted in Infoskills

Follow

Get every new post delivered to your Inbox.

Join 784 other followers