One of the many things I vaguely remember studying from my school maths days are the various geometric transformations – rotations, translations and reflections – as applied particularly to 2D shapes. To a certain extent, knowledge of these operations helps me use the limited Insert Shape options in Powerpoint, as I pick shapes and arrows from the limited palette available and then rotate and reflect them to get the orientation I require.
But of more pressing concern to me on a daily basis is the need to engage in data transformations, whether as summary statistic transformations (find the median or mean values within several groups of the same dataset, for example, or calculating percentage differences away from within group means across group members for multiple groups, or shape transformations, reshaping a dataset from a wide to a long format, for example, melting a subset of columns or recasting a molten dataset into a wider format. (If that means nothing to you, I’m not surprised. But if you’ve ever worked with a dataset and copied and pasted data from multiple columns in to multiple rows to get it to look right/into the shape you want, you’ve suffered by not knowing how to reshape your dataset!)
Even though I tinker with data most days, I tend to avoid all but the simplest statistics. I know enough to know I don’t understand most statistical arcana, but I suspect there are folk who do know how to do that stuff properly. But what I do know from my own tinkering is that before I can run even the simplest stats, I often have to do a lot of work getting original datasets into a state where I can actually start to work with them.
The same stumbling blocks presumably present themselves to the data scientists and statisticians who not only know how to drive arcane statistical tests but also understand how to interpret and caveat them. Which is where tools like Open Refine come in…
Further down the pipeline are the policy makers and decision makers who use data to inform their policies and decisions. I don’t see why these people should be able to write a regexp, clean a dirty dataset, denormalise a table, write a SQL query, run a weird form of multivariate analysis, or reshape a dataset and then create a novel data visualisation from it based on a good understanding of the principles of The Grammar of Graphics; but I do think they should be able to pick up on the stories contained within the data and critique the way it is presented, as well as how the data was sourced and the operations applied to it during analysis, in addition to knowing how to sensibly make use of the data as part of the decision making or policy making process.
A recent Nesta report (July 2015) on Analytic Britain: Securing the right skills for the data-driven economy [PDF] gave a shiny “analytics this, analytics that” hype view of something or other (I got distracted by the analytics everything overtone), and was thankfully complemented by a more interesting report from the Universities UK report (July 2015) on Making the most of data: Data skills training in English universities [PDF].
In its opening summary, the UUK report found that “[t]he data skills shortage is not simply characterised by a lack of recruits with the right technical skills, but rather by a lack of recruits with the right combination of skills”, and also claimed that “[m]any undergraduate degree programmes teach the basic technical skills needed to understand and analyse data”. Undergrads may learn basic stats, but I wonder how many of them are comfortable with the hand tools of data wrangling that you need to be familiar with if you ever want to turn real data into something you can actually work with? That said, the report does give a useful review of data skills developed across a range of university subject areas.
(Both reports championed the OU-led urban data school, though I have to admit I can’t find any resources associated with that project? Perhaps the OU’s Smart Cities MOOC on FutureLearn is related to it? As far as I know, OUr Learn to Code for Data Analysis MOOC isn’t?)
From my perspective, I think it’d be a start if folk learned:
- how to read simple charts;
- how to identify meaningful stories in charts;
- how to use data stories to inform decision making.
I also worry about the day-to-day practicalities of working with data in a hands on fashion and the roles associated with various data related tasks that fall along any portrayal of the data pipeline. For example, of the top of my head I think we can distinguish between things like:
- data technician roles – for example, reshaping and cleaning datasets;
- data engineering roles – managing storage, building and indexing databases, for example;
- data analyst/science and data storyteller roles – that is, statisticians who can work with clean and well organised datasets to pull out structures, trends and patterns from within them;
- data graphics/visualisation practitioners – who have the eye and the skills for developing visual ways of uncovering and relating the stories, trends, patterns and structures hidden in datasets, perhaps in support of the analyst, perhaps in support of the decision-making end-user ;
- and data policymakers and data driven decision makers, who can phrase questions in such a way that makes it possible to use data to inform the decision or policymaking process, even if they don’t have to skills to wrangle or analyse the data that they can then use.
I think there is also a role for data questionmasters who can phrase and implement useful and interesting queries that can be applied to datasets, which might also fall to the data technician. I also see a role for data technologists, who are perhaps strong as a data technician, but with an appreciation of the engineering, science, visualisation and decision/policy making elements, though not necessarily strong as a practitioner in any of those camps.
(Data carpentry as a term is also useful, describing a role that covers many of the practical skills requirements I’d associate with a data technician, but that additionally supports the notion of “data craftsmanship”? A lot of data wrangling does come down to being a craft, I think, not least because the person working at the raw data end of the lifecycle may often develop specialist, hand crafted tools for working with the data that an analyst would not be able to justify spending the development time on.)
Here’s another carving of the data practitioner roles space, this time from Liz Lyon & Aaron Brenner (Bridging the Data Talent Gap: Positioning the iSchool as an Agent for Change, International Journal of Digital Curation, 10:1 (2015)):
The Royal Statistical Society Data Manifesto [PDF] (September 2014) argues for giving “[p]oliticians, policymakers and other professionals working in public services (such as regulators, teachers, doctors, etc.) … basic training in data handling and statistics to ensure they avoid making poor decisions which adversely affect citizens” and suggest that we need to “prepare for the data economy” by “skill[ing] up the nation”:
We need to train teachers from primary school through to university lecturers to encourage data literacy in young people from an early age. Basic data handling and quantitative skills should be an integral part of the taught curriculum across most A level subjects. … In particular, we should ensure that all students learn to handle and interpret real data using technology.
I like the sentiment of the RSS manifesto, but fear the Nesta buzzword hype chasing and the conservatism of the universities (even if the UUK report is relatively open minded).
On the one hand, we often denigrate the role of the technician, but I think technical difficulties associated with working with real data are often a real blocker; which means we either skill up ourselves, or recognise the need for skilled data technicians. On the other, I think there is a danger of hyping “analytics this” and “data science that” – even if only as part of debunking it – because it leads us away from the more substantive point that analytics this, data science that is actually about getting numbers into a form that tell stories that we can use to inform decisions and policies. And that’s more about understanding patterns and structures, as well as critiquing data collection and analysis methods, than it is about being a data technician, engineer, analyst, geek, techie or quant.
Which is to say – if we need to develop data literacy, what does that really mean for the majority?
PS Heh heh – Kin Lane captures further life at the grotty end of the data lifecycle: Being a Data Janitor and Cleaning Up Data Portability Vomit.