Despite believing we can treat anything we can represent in digital form as “data”, I’m still pretty flakey on understanding what sorts of analysis we can easily do with different sorts of data. Time series analysis is one area – the pandas Python library has all manner of handy tools for working with that sort of data that I have no idea how to drive – and text analysis is another.
So prompted by Sheila MacNeill’s post about textexture, which I guessed might be something to do with topic modeling (I should have read the about, h/t @mhawksey), here’s a quick round up of handy things the text analysts seem to be able to do pretty easily…
Taking the lazy approach, I has a quick look at the CRAN natural language processing task view to get an idea of what sort of tool support for text analysis there is in R, and a peek through the NLTK documentation to see what sort of thing we might be readily able to do in Python. Note that this take is a personal one, identifying the sorts of things that I can see I might personally have a recurring use for…
First up – extracting text from different document formats. I’ve already posted about Apache Tika, which can pull text from a wide range of documents (PDFs, extract text from Word docs, extract text from images), which seems to be a handy, general purpose tool. (Other tools are available, but I only have so much time, and for now Tika seems to do what I need…)
Second up, concordance views. The NLTK docs describe concordance views as follows: “A concordance view shows us every occurrence of a given word, together with some context.” So for example:
This can be handy for skimming through multiple references to a particular item, rather than having to do a lot of clicking, scrolling or page turning.
How about if we want to compare the near co-occurrence of words or phrases in a document? One way to do this is graphically, plotting the “distance” through the text on the x-axis, and then for categorical terms on y marking out where those terms appear in the text. In NLTK, this is referred to as a lexical dispersion plot:
I guess we could then scan across the distance axis using a windowing function to find terms that appear within a particular distance of each other? Or use co-occurrence matrices for example (eg Co-occurrence matrices of time series applied to literary works), perhaps with overlapping “time” bins? (This could work really well as a graph model – eg for 20 pages, set up page nodes 1-2, 2-3, 3-4,.., 18-19, 19-20, then an actor node for each actor, connecting actors to page nodes for page bins on which they occur; then project the bipartite graph onto just the actor nodes, connecting actors who were originally to the same page bin nodes.)
Something that could be differently useful is spotting common sentences that appear in different documents (for example, quotations). There are surely tools out there that do this, though offhand I can’t find any..? My gut reaction would be to generate a sentence list for each document (eg using something like the handy looking textblob python library), strip quotation marks and whitespace, etc, sort each list, then run a diff on them and pull out the matched lines. (So a “reverse differ”, I think it’s called?) I’m not sure if you could easily also pull out the near misses? (If you can help me out on how to easily find matching or near matching sentences across documents via a comment or link, it’d be appreciated…:-)
The more general approach is to just measure document similarity – TF-IDF (Term Frequency – Inverse Document Frequency) and cosine similarity are key phrases here. I guess this approach could also be applied to sentences to find common ones across documents, (eg SO: Similarity between two text documents), though I guess it would require comparing quite a large number of sentences (for ~N sentences in each doc, it’d require N^2 comparisons)? I suppose you could optimise by ignoring comparisons between sentences of radically different lengths? Again, presumably there are tools that do this already?
Unlike simply counting common words that aren’t stop words in a document to find the most popular words in a doc, TF-IDF moderates the simple count (the term frequency) with the inverse document frequency. If a word is popular in every document, the term frequency is large and the document frequency is large, so the inverse document frequency (one divided by the document frequency) is small – which in turn gives a reduced TF-IDF value. If a term is popular in one document but not any other, the document frequency is small and so the relative document frequency is large, giving a large TF-IDF for the term in the rare document in which it appears. TF-IDF helps you spot words that are rare across documents or uncommonly frequent within documents.
Topic models: I thought I’d played with these quite a bit before, but if I did the doodles didn’t make it as far as the blog… The idea behind topic modeling is generate a set of key terms – topics – that provide an indication of the topic of a particular document. (It’s a bit more sophisticated than using a count of common words that aren’t stopwords to characterise a document, which is the approach that tends to be used when generating wordclouds…) There are some pointers in the comments to A Quick View Over a MASHe Google Spreadsheet Twitter Archive of UKGC12 Tweets about topic modeling in R using the R topicmodels package; this ROpenSci post on Topic Modeling in R has code for a nice interactive topic explorer; this notebook on Topic Modeling 101 looks like a handy intro to topic modeling using the gensim Python package.
Automatic summarisation/text summary generation: again, I thought I dabbled with this but there’s no sign of it on this blog:-( There are several tools and recipes out there that will generate text summaries of long documents, but I guess they could be hit and miss and I’d need to play with a few of them to see how easy they are to use and how well they seem to work/how useful they appear to be. The python sumy package looks quite interesting in this respect (example usage) and is probably where I’d start. A simple description of a basic text summariser can be found here: Text summarization with NLTK.
So – what have I missed?
PS In passing, see this JISC review from 2012 on the Value and Benefits of Text Mining.
Increasingly I find that I have fallen into using not-really-R whilst playing around with Formula One stats data. Instead, I seem to be using a hybrid of SQL to get data out of a small SQLite3 datbase and into an R dataframe, and then ggplot2 to render visualise it.
So for example, I’ve recently been dabbling with laptime data from the ergast database, using it as the basis for counts of how many laps have been led by a particular driver. The recipe typically goes something like this – set up a database connection, and run a query:
#Set up a connection to a local copy of the ergast database library(DBI) ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite') #Run a query q='SELECT code, grid, year, COUNT(l.lap) AS Laps FROM (SELECT grid, raceId, driverId from results) rg, lapTimes l, races r, drivers d WHERE rg.raceId=l.raceId AND d.driverId=l.driverId AND rg.driverId=l.driverId AND l.position=1 AND r.raceId=l.raceId GROUP BY grid, driverRef, year ORDER BY year' driverlapsledfromgridposition=dbGetQuery(ergastdb,q)
In this case, the data is table that shows for each year a count of laps led by each driver given their grid position in corresponding races (null values are not reported). The data grabbed from the database is based into a dataframe in a relatively tidy format, from which we can easily generate a visualisation of it.
The chart I have opted for is a text plot faceted by year:
The count of lead laps for a given driver by grid position is given as a text label, sized by count, and rotated to mimimise overlap. The horizontal grid is actually a logarithmic scale, which “stretches out” the positions at the from of the grid (grid positions 1 and 2) compared to positions lower down the grid – where counts are likely to be lower anyway. To try to recapture some sense of where grid positions lay along the horizontal axis, a dashed vertical line at grid position 2.5 marks out the front row. The x-axis is further expanded to mitigate against labels being obfuscated or overflowing off the left hand side of the plotting area. The clean black and white theme finished off the chart.
g = ggplot(driverlapsledfromgridposition) g = g + geom_vline(xintercept = 2.5, colour='lightgrey', linetype='dashed') g = g + geom_text(aes(x=grid, y=code, label=Laps, size=log(Laps), angle=45)) g = g + facet_wrap(~year) + xlab(NULL) + ylab(NULL) + guides(size=FALSE) g + scale_x_log10(expand=c(0,0.3)) + theme_bw()
There are still a few problems with this graphic, however. The order of labels on the y-axis is in alphabetical order, and would perhaps be more informative if ordered to show championship rankings, for example.
However, to return to the main theme of this post, whilst the R language and RStudio environment are being used as a medium within which this activity has taken place, the data wrangling and analysis (in the sense of counting) is being performed by the SQL query, and the visual representation and analysis (in the sense of faceting, for example, and generating visual cues based on data properties) is being performed by routines supplied as part of the ggplot library.
So if asked whether this is an example of using R for data analysis and visualisation, what would your response be? What does it take for something to be peculiarly or particularly an R based analysis?
Perhaps it’s just because my antennae are sensitised at the moment, post posting Open Practice and My Academic Philosophy, Sort Of… Erm, Maybe… Perhaps..?!, but here are a couple more folk saying much the same thing…
From @Downes getting on for five years ago now (The Role of the Educator), he mentions how several elements of his open practice (hacking useful code, running open online courses (though he just calls them “online courses”; five years ago, remember, before “open” was the money phrase?!;-), sharing through a daily links round up and conference presentations, and thinking about stuff) have led:
to an overall approach not only to learning online but to learning generally. It’s not simply that I’ve adopted this approach; it’s that I and my colleagues have observed this approach emerging in the community generally.
It’s an approach that emphasizes open learning and learner autonomy. It’s an approach that argues that course content is merely a tool employed to stimulate and support learning — a McGuffin, as I’ve called it in various presentations, “a plot element that catches the viewers attention or drives the plot of a work of fiction” — rather than the object of learning itself. It’s an approach that promotes a pedagogy of learning by engagement and activity within an authentic learning community — a community of practitioners, where people practice the discipline, rather than merely just talk about it.
It’s an approach that emphasizes exercises involving those competencies rather than deliberate acts of memorization or rote, an approach that seeks to grow knowledge in a manner analogous to building muscles, rather than to transfer or construct knowledge through some sort of cognitive process.
It’s an approach that fosters a wider and often undefined set of competencies associated with a discipline, a recognition that knowing, say, physics, isn’t just to know the set of facts and theories related to physics, but rather to embody a wider set of values, beliefs, ways of observing and even mannerisms associated with being a physicist (it is the caricature of this wider set of competencies that makes The Big Bang Theory so funny).
Concordant with this approach has been the oft-repeated consensus that the role of the educator will change significantly. Most practitioners in the field are familiar with the admonishment that an educator will no longer be a “sage on the stage”. But that said, many others resist the characterization of an educator as merely a “guide by the side.” We continue to expect educators to play an active role in learning, but it has become more difficult to characterize exactly what that role may be.
In my own work, I have stated that the role of the teacher is to “model and demonstrate.” What I have tried to capture in this is the idea that students need prototypes on which to model their own work. Readers who have learned to program computers by copying and adapting code will know what I mean. But it’s also, I suppose, why I see the footprints of Raymond Chandler all through William Gibson’s writing. We begin by copying successful practice, and then begin to modify that practice to satisfy our own particular circumstances and needs.
In order for this to happen, the instructor must be more than just a presenter or lecturer. The instructor, in order to demonstrate practice, is required to take a more or less active role in the disciplinary or professional community itself, demonstrating by this activity successful tactics and techniques within that community, and modeling the approach, language and world view of a successful practitioner. This is something we see in medicine already, as students learn as interns working alongside doctors or nurse practitioners.
Five years ago…
At the other end of the career spectrum, grad student Sarah Crissinger had to write a “one-page teaching philosophy” as part of a recent job application (Reflections on the Job Hunt: Writing a Teaching Philosophy). Reflecting on two different approaches to teaching she had witnessed from two different yoga classes, one good, one bad, she observed of the effective teacher that:
[h]e starts every class by telling students that the session isn’t about replicating the exact pose he is doing. It’s more about how your individual body feels in the pose. In other words, he empowers students to do what they can without feeling shame about not being as flexible as their neighbor. He also solidifies the expectations of the class by saying upfront what the goals are and then he reiterates those expectations by giving modifications for each pose and talking about how your body should feel instead of how it should look.
..which in part reminded me of cookery style promoted by James Barber, aka the urban peasant…
Sarah Crissinger also made this nice observation:
Teachers reflect on teaching even when we don’t mean to.
That is, effective teachers are also adaptive learning machines… (Reflection is part of the self-correcting feedback path.)
See also: Sheila McNeil on How do you mainstream open education and OERs? A bit of feedback sought for #oer15, and the comments therefrom. Sheila’s approach also brings to mind The Art Of Guerrilla Research, which emphasises the “just do it” attitude of open practice…
Rebuilding a fresh version of the TM351 VM from scratch yesterday, I got an error trying to install tty.js, a node.js app that provides a “terminal desktop in the browser”.
Looking into a copy of the VM where tty.js does work, I could discover the version of node I’d previously successfully used, as well as check all the installed package versions:
### Show nodejs version and packages > node -v v0.10.35 > npm list -g /usr/local/lib ├─┬ firstname.lastname@example.org │ ├── email@example.com │ ├── firstname.lastname@example.org │ ├── email@example.com │ ├── firstname.lastname@example.org │ ├── email@example.com │ ├── firstname.lastname@example.org ... │ └── email@example.com └─┬ firstname.lastname@example.org ├─┬ email@example.com │ ├── firstname.lastname@example.org ...
Using this information, I could then use nvm, a node.js version manager, installed via:
curl https://raw.githubusercontent.com/creationix/nvm/v0.23.3/install.sh | NVM_DIR=/usr/local/lib/ bash
to install, from a new shell, the version I knew worked:
nvm install 0.10.35
npm install tty.js
(I should probably add the tty.js version in there too? npm install email@example.com perhaps? )
The terminal can then be run as a demon from:
/usr/local/lib/node_modules/tty.js/bin/tty.js --port 3000 --daemonize
What this got me wondering was: are there any utilities that let you capture a nodejs configuration, for example, and the recreate it in a new machine. That is, export the node version number and versions of the installed packages, then create an installation script that will recreate that setup?
It would be handy if this approach could be extended further. For example, we can also look at the packages – and their version numbers – installed on the Linux box using:
### Show packages
And we can get a list of Python packages – and their version numbers – using:
### Show Python packages
Surely there must be some simple tools/utilities out that support this sort of thing? Or even just cheatsheets that show you what commands to run to export the packages and versions into a file in a format that allows you to use that file as part of an installation script in a new machine to help rebuild the original one?
Having got my promotion case through the sub-Faculty level committee (with support and encouragement from senior departmental colleagues), it’s time for another complete rewrite to try to get it though the Faculty committee. Guidance suggests that it is not inappropriate – and may even be encouraged – for a candidate to include something about their academic philosophy, so here are some scribbled thoughts on mine…
One of the declared Charter objects (sic) of the Open University is "to promote the educational well-being of the community generally", as well as " the advancement and dissemination of learning and knowledge". Both as a full-time PhD student with the OU (1993-1997), and then as an academic (1999-), I have pursued a model of open practice, driven by the idea of learning in public, with the aim of communicating academic knowledge into, and as part of, wider communities of practice, modeling learning behaviour through demonstrating my own learning processes, and originating new ideas in a challengeable and open way as part of my own learning journey.
My interest in open educational resources is in part a subterfuge, driven by a desire that educators be more open in demonstrating their own learning and critical practices, including the confusion and misconceptions they grapple with along the way, rather than being seen simply as professors of some sort of inalienable academic truth.
My interest in short course development is based on the belief that for the University to contribute effectively to continued lifelong education and professional development, we need to have offerings that are at an appropriate level of granularity as well as academic level. Degrees represent only one - early part - of that journey. Learners are unlikely to take more than one undergraduate degree in their lifetime, but there is no reason why they should not continue to engage in learning throughout their life. Evidence from the first wave of MOOCs suggests that many participants in those courses were already graduates, with an appreciation of the values of learning and the skills to enable them to engage with those offerings. The characteristation of MOOCs as
cMOOCs xMOOCs (traditional course style offerings) or the looser networked modeled "connectivist MOOCs", xMOOCs cMOOCs, [H/T @r3becca in the comments;-)] represent different educational philosophies: the former may cruelly be described as being based on a model in which the learner expects to be taught (and the instructors expect to profess), whereas the latter requires that participants are engaged in a more personal, yet still collaborative, learning journey, where it is up to each participant to make sense of the world in an open and public way, informed and aided, but also challenged, by other participants. That's how I work every day. I try to make sense of the world to myself, often for a purpose, in public.
Much of my own learning is the direct result of applied problem solving. I try to learn something every day, often as the result of trying to do something each day that I haven't been able to do before. The OUseful.info blog is my own learning diary and a place I can look to refer to things I have previously learned. The posts are written in a way that reinforces my own learning, as a learning resource. The posts often take longer to write than the time taken to discover or originate the thing learned, because in them I try to represent a reflection and retelling of the rationale for the learning event and the context in which it arose: a problem to be solved, my state of knowledge at the time, the means by which I came to make sense of the situation in order to proceed, and the learning nugget that resulted. The thing I can see or do now but couldn't before. Capturing the "I couldn't do X because of Y but now I can, by doing Z" supports a similar form of discovery as the one supported by question and answer sites: the content is auto-optimised to include both naive and expert information, which aids discovery. (It often amused me that course descriptions would often be phrased in the terms and language you might expect to know having completed the course. Which doesn't help the novice discover it a priori, before they have learned those keywords, concepts or phrases that the course will introduce them to...). The posts also try to model my own learning process, demonstrating the confusion, showing where I had a misapprehension of just plain got it wrong. The blog also represents a telling of my own learning journey over an extended period of time, and such may be though of as an uncourse, something that could perhaps be looked at post hoc as a course but that was originated as my own personal learning journey unfolded.
Hmmm… 1500 words for the whole begging letter, so I need to cut the above down to a sentence…
It’s been some time now since I drafted most of my early unit contributions to the TM351 Data management and analysis course. Part of the point (for me) in drafting that material was to find out what sorts of thing we actually wanted to say and help identify the sorts of abstractions we wanted to then build a narrative around. Another part of this (for me) means exploring new ways of putting powerful “academic” ideas and concepts into meaningful, contexts; finding new ways to describe them; finding ways of using them in conjunction with other ideas; or finding new ways of using – or appropriating them – in general (which in turn may lead to new ways of thinking about them). These contexts are often process based, demonstrating how we can apply the ideas or put them to use (make them useful…) or use the ideas to support problem identification, problem decomposition and problem solving. At heart, I’m more of a creative technologist than a scientist or an engineer. (I aspire to being an artist…;-)
Someone who I think has a great take on conceptualising the data wrangling process – in part arising from his prolific tool building approach in the R language – is Hadley Wickham. His recent work for RStudio is built around an approach to working with data that he’s captured as follows (e.g. “dplyr” tutorial at useR 2014 , Pipelines for Data Analysis):
Following an often painful and laborious process of getting data into a state where you can actually start to work with it), you can then enter into an iterative process of transforming the data into various shapes and representations (often in the sense of re-presentations) that you can easily visualise or build models from. (In practice, you may have to keep redoing elements of the tidy step and then re-feed the increasingly cleaned data back into the sensemaking loop.)
Hadley’s take on this is that the visualisation phase can spring surprises on you but doesn’t scale very well, whilst the modeling phase scales but doesn’t surprise you.
To support the different phases of activity, Hadley has been instrumental in developing several software libraries for the R programming language that are particular suited to the different steps. (For the modeling, there are hundreds of community developed and often very specialised R libraries for doing all manner of weird and wonderful statistics…)
In many respects, I’ve generally found the way Hadley has presented his software libraries to be deeply pragmatic – the tools he’s developed are useful and in many senses naturalistic; they help you do the things you need to do in a way that makes practical sense. The steps they encourage you to take are natural ones, and useful ones. They are the sorts of tools that implement the sorts of ideas that come to mind when you’re faced with a problem and you think: this is the sort of thing I need (to be able) to do. (I can’t comment on how well implemented they are; I suspect: pretty well…)
Just as the data wrangling process diagram helps frame the sorts of things you’re likely to do into steps that make sense in a “folk computational” way (in the sense of folk computing or folk IT (also here), a computational correlate to notions of folk physics, for example), Hadley also has a handy diagram for helping us think about the process of solving problems computationally in a more general, problem solving sense:
A cognitive think it step, identifying a problem, and starting to think about what sort of answer you want from it, as well as how you might start to approach it; a describe it step, where you describe precisely what it is you want to do (the sort of step where you might start scribbling pseudo-code, for example); and the computational do it step where the computational grunt work is encoded in a way that allows it to actually get done by machine.
I’ve been pondering my own stance towards computing lately, particularly from my own context of someone who sees computery stuff from a more technology, tool building and tool using context, (that is, using computery things to help you do useful stuff), rather than framing it as a purer computer science or even “trad computing” take on operationalised logic, where the practical why is often ignored.
So I think this is how I read Hadley’s diagram…
Figuring out what the hell it is you want to do (imagining, the what for a particular why), figuring out how to do it (precisely; the programming step; the how); hacking that idea into a form that lets a machine actually do it for you (the coding step; the step where you express the idea in a weird incantation where every syllable has to be the right syllable; and from which the magic happens).
One of the nice things about Hadley’s approach to supporting practical spell casting (?!) is that transformation or operational steps his libraries implement are often based around naturalistic verbs. They sort of do what they say on the tin. For example, in the dplyr toolkit, there are the following verbs:
These sort of map onto elements (often similarly named) familiar to anyone who has used SQL, but in a friendlier way. (They don’t SHOUT AT YOU for a start.) It almost feels as if they have been designed as articulations of the ideas that come to mind when you are trying to describe (precisely) what it is you actually want to do to a dataset when working on a particular problem.
In a similar way, the ggvis library (the interactive chart reinvention of Hadley’s ggplot2 library) builds on the idea of Leland Wilkinson’s “The Grammar of Graphics” and provides a way of summoning charts from data in an incremental way, as well as a functionally and grammatically coherent way. The words the libraries use encourage you to articulate the steps you think you need to take to solve a problem – and then, as if by magic, they take those steps for you.
If programming is the meditative state you need to get into to cast a computery-thing spell, and coding is the language of magic, things like dplyr help us cast spells in the vernacular.
You know how it goes – you start trying to track down a forward “forthcoming” reference and you end up wending your way through all manner of things until you get back to where you started none the wiser… So here’s a snapshot of several docs I found trying to source the original forward reference for following table, found in Improving information to support decision making: standards for better quality data (November 2007, first published October 2007) with the crib that several references to it mentioned the Audit Commission…
The first thing I came across was The Use of Information in Decision Making – Literature Review for the Audit Commission (2008), prepared by Dr Mike Kennerley and Dr Steve Mason, Centre for Business Performance Cranfield School of Management, but that wasn’t it… This document does mention a set of activities associated with the data-to-decision process: Which Data, Data Collection, Data Analysis, Data Interpretation, Communication, Decision making/planning.
The data and information definitions from the table do appear in a footnote – without reference – in Nothing but the truth? A discussion paper from the Audit Commission in Nov 2009, but that’s even later… The document does, however, identify several characteristics (cited from an earlier 2007 report (Improving Information, mentioned below…), and endorsed at the time by Audit Scotland, Northern Ireland Audit Office, Wales Audit Office and CIPFA, with the strong support of the National Audit Office), that contribute to a notion of “good quality” data:
Good quality data is accurate, valid, reliable, timely, relevant and complete. Based on existing guidance and good practice, these are the dimensions reflected in the voluntary data standards produced by the Audit Commission and the other UK audit agencies
* Accuracy – data should be sufficiently accurate for the intended purposes.
* Validity – data should be recorded and used in compliance with relevant requirements.
* Reliability – data should reflect stable and consistent data collection processes across collection points and over time.
* Timeliness – data should be captured as quickly as possible after the event or activity and must be available for the intended use within a reasonable time period.
* Relevance – data captured should be relevant to the purposes for which it is used.
* Completeness – data requirements should be clearly specified based on the information needs of the body and data collection processes matched to these requirements.
The document also has some pretty pictures, such as this one of the data chain:
In the context of the data/information/knowledge definitions, the Audit Commission discussion document also references the 2008 HMG strategy document Information matters: building government’s capability in managing knowledge and information which includes the table in full; a citation link is provided, but 404s, but a source is given to the November 2008 version of Improving information, the one we originally started with. So the original reference forward refers the table to an unspecified report, but future reports in the area refer back to that “original” without making a claim to the actual table itself?
Just in passing, whilst searching for the Improving information report, I actually found another version of it… Improving information to support decision making: standards for better quality data, Audit Commission, first published March 2007.
The table and the definitions as cited in Information Matters do not seem to appear in this earlier version of the document?
PS Other tables do appear in both versions of the report. For example, both the March 2007 and November 2007 versions of the doc contain this table (here, taken from the 2008 doc) of stakeholders:
Anyway, aside from all that, several more documents for my reading list pile…
PS see also Audit Commission – “In the Know” from February 2008.