So What Can Text Analysis Do for You?

Despite believing we can treat anything we can represent in digital form as “data”, I’m still pretty flakey on understanding what sorts of analysis we can easily do with different sorts of data. Time series analysis is one area – the pandas Python library has all manner of handy tools for working with that sort of data that I have no idea how to drive – and text analysis is another.

So prompted by Sheila MacNeill’s post about textexture, which I guessed might be something to do with topic modeling (I should have read the about, h/t @mhawksey), here’s a quick round up of handy things the text analysts seem to be able to do pretty easily…

Taking the lazy approach, I has a quick look at the CRAN natural language processing task view to get an idea of what sort of tool support for text analysis there is in R, and a peek through the NLTK documentation to see what sort of thing we might be readily able to do in Python. Note that this take is a personal one, identifying the sorts of things that I can see I might personally have a recurring use for…

First up – extracting text from different document formats. I’ve already posted about Apache Tika, which can pull text from a wide range of documents (PDFs, extract text from Word docs, extract text from images), which seems to be a handy, general purpose tool. (Other tools are available, but I only have so much time, and for now Tika seems to do what I need…)

Second up, concordance views. The NLTK docs describe concordance views as follows: “A concordance view shows us every occurrence of a given word, together with some context.” So for example:

This can be handy for skimming through multiple references to a particular item, rather than having to do a lot of clicking, scrolling or page turning.

How about if we want to compare the near co-occurrence of words or phrases in a document? One way to do this is graphically, plotting the “distance” through the text on the x-axis, and then for categorical terms on y marking out where those terms appear in the text. In NLTK, this is referred to as a lexical dispersion plot:

I guess we could then scan across the distance axis using a windowing function to find terms that appear within a particular distance of each other? Or use co-occurrence matrices for example (eg Co-occurrence matrices of time series applied to literary works), perhaps with overlapping “time” bins? (This could work really well as a graph model – eg for 20 pages, set up page nodes 1-2, 2-3, 3-4,.., 18-19, 19-20, then an actor node for each actor, connecting actors to page nodes for page bins on which they occur; then project the bipartite graph onto just the actor nodes, connecting actors who were originally to the same page bin nodes.)

Something that could be differently useful is spotting common sentences that appear in different documents (for example, quotations). There are surely tools out there that do this, though offhand I can’t find any..? My gut reaction would be to generate a sentence list for each document (eg using something like the handy looking textblob python library), strip quotation marks and whitespace, etc, sort each list, then run a diff on them and pull out the matched lines. (So a “reverse differ”, I think it’s called?) I’m not sure if you could easily also pull out the near misses? (If you can help me out on how to easily find matching or near matching sentences across documents via a comment or link, it’d be appreciated…:-)

The more general approach is to just measure document similarity – TF-IDF (Term Frequency – Inverse Document Frequency) and cosine similarity are key phrases here. I guess this approach could also be applied to sentences to find common ones across documents, (eg SO: Similarity between two text documents), though I guess it would require comparing quite a large number of sentences (for ~N sentences in each doc, it’d require N^2 comparisons)? I suppose you could optimise by ignoring comparisons between sentences of radically different lengths? Again, presumably there are tools that do this already?

Unlike simply counting common words that aren’t stop words in a document to find the most popular words in a doc, TF-IDF moderates the simple count (the term frequency) with the inverse document frequency. If a word is popular in every document, the term frequency is large and the document frequency is large, so the inverse document frequency (one divided by the document frequency) is small – which in turn gives a reduced TF-IDF value. If a term is popular in one document but not any other, the document frequency is small and so the relative document frequency is large, giving a large TF-IDF for the term in the rare document in which it appears. TF-IDF helps you spot words that are rare across documents or uncommonly frequent within documents.

Topic models: I thought I’d played with these quite a bit before, but if I did the doodles didn’t make it as far as the blog… The idea behind topic modeling is generate a set of key terms – topics – that provide an indication of the topic of a particular document. (It’s a bit more sophisticated than using a count of common words that aren’t stopwords to characterise a document, which is the approach that tends to be used when generating wordclouds…) There are some pointers in the comments to A Quick View Over a MASHe Google Spreadsheet Twitter Archive of UKGC12 Tweets about topic modeling in R using the R topicmodels package; this ROpenSci post on Topic Modeling in R has code for a nice interactive topic explorer; this notebook on Topic Modeling 101 looks like a handy intro to topic modeling using the gensim Python package.

Automatic summarisation/text summary generation: again, I thought I dabbled with this but there’s no sign of it on this blog:-( There are several tools and recipes out there that will generate text summaries of long documents, but I guess they could be hit and miss and I’d need to play with a few of them to see how easy they are to use and how well they seem to work/how useful they appear to be. The python sumy package looks quite interesting in this respect (example usage) and is probably where I’d start. A simple description of a basic text summariser can be found here: Text summarization with NLTK.

So – what have I missed?

PS In passing, see this JISC review from 2012 on the Value and Benefits of Text Mining.

Tools in Tandem – SQL and ggplot. But is it Really R?

Increasingly I find that I have fallen into using not-really-R whilst playing around with Formula One stats data. Instead, I seem to be using a hybrid of SQL to get data out of a small SQLite3 datbase and into an R dataframe, and then ggplot2 to render visualise it.

So for example, I’ve recently been dabbling with laptime data from the ergast database, using it as the basis for counts of how many laps have been led by a particular driver. The recipe typically goes something like this – set up a database connection, and run a query:

#Set up a connection to a local copy of the ergast database
library(DBI)
ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

#Run a query
q='SELECT code, grid, year, COUNT(l.lap) AS Laps
FROM (SELECT grid, raceId, driverId from results) rg,
lapTimes l, races r, drivers d
WHERE rg.raceId=l.raceId AND d.driverId=l.driverId
AND rg.driverId=l.driverId AND l.position=1 AND r.raceId=l.raceId
GROUP BY grid, driverRef, year
ORDER BY year'

driverlapsledfromgridposition=dbGetQuery(ergastdb,q)


In this case, the data is table that shows for each year a count of laps led by each driver given their grid position in corresponding races (null values are not reported). The data grabbed from the database is based into a dataframe in a relatively tidy format, from which we can easily generate a visualisation of it.

The chart I have opted for is a text plot faceted by year:

The count of lead laps for a given driver by grid position is given as a text label, sized by count, and rotated to mimimise overlap. The horizontal grid is actually a logarithmic scale, which “stretches out” the positions at the from of the grid (grid positions 1 and 2) compared to positions lower down the grid – where counts are likely to be lower anyway. To try to recapture some sense of where grid positions lay along the horizontal axis, a dashed vertical line at grid position 2.5 marks out the front row. The x-axis is further expanded to mitigate against labels being obfuscated or overflowing off the left hand side of the plotting area. The clean black and white theme finished off the chart.

g = ggplot(driverlapsledfromgridposition)
g = g + geom_vline(xintercept = 2.5, colour='lightgrey', linetype='dashed')
g = g + geom_text(aes(x=grid, y=code, label=Laps, size=log(Laps), angle=45))
g = g + facet_wrap(~year) + xlab(NULL) + ylab(NULL) + guides(size=FALSE)
g + scale_x_log10(expand=c(0,0.3)) + theme_bw()

There are still a few problems with this graphic, however. The order of labels on the y-axis is in alphabetical order, and would perhaps be more informative if ordered to show championship rankings, for example.

However, to return to the main theme of this post, whilst the R language and RStudio environment are being used as a medium within which this activity has taken place, the data wrangling and analysis (in the sense of counting) is being performed by the SQL query, and the visual representation and analysis (in the sense of faceting, for example, and generating visual cues based on data properties) is being performed by routines supplied as part of the ggplot library.

So if asked whether this is an example of using R for data analysis and visualisation, what would your response be? What does it take for something to be peculiarly or particularly an R based analysis?

For more details, see the “Laps Completed and Laps Led” draft chapter and the Wrangling F1 Data With R book.

Code as Magic, and the Vernacular of Data Wrangling Verbs

It’s been some time now since I drafted most of my early unit contributions to the TM351 Data management and analysis course. Part of the point (for me) in drafting that material was to find out what sorts of thing we actually wanted to say and help identify the sorts of abstractions we wanted to then build a narrative around. Another part of this (for me) means exploring new ways of putting powerful “academic” ideas and concepts into meaningful, contexts; finding new ways to describe them; finding ways of using them in conjunction with other ideas; or finding new ways of using – or appropriating them – in general (which in turn may lead to new ways of thinking about them). These contexts are often process based, demonstrating how we can apply the ideas or put them to use (make them useful…) or use the ideas to support problem identification, problem decomposition and problem solving. At heart, I’m more of a creative technologist than a scientist or an engineer. (I aspire to being an artist…;-)

Someone who I think has a great take on conceptualising the data wrangling process – in part arising from his prolific tool building approach in the R language – is Hadley Wickham. His recent work for RStudio is built around an approach to working with data that he’s captured as follows (e.g. “dplyr” tutorial at useR 2014 , Pipelines for Data Analysis):

Following an often painful and laborious process of getting data into a state where you can actually start to work with it), you can then enter into an iterative process of transforming the data into various shapes and representations (often in the sense of re-presentations) that you can easily visualise or build models from. (In practice, you may have to keep redoing elements of the tidy step and then re-feed the increasingly cleaned data back into the sensemaking loop.)

Hadley’s take on this is that the visualisation phase can spring surprises on you but doesn’t scale very well, whilst the modeling phase scales but doesn’t surprise you.

To support the different phases of activity, Hadley has been instrumental in developing several software libraries for the R programming language that are particular suited to the different steps. (For the modeling, there are hundreds of community developed and often very specialised R libraries for doing all manner of weird and wonderful statistics…)

In many respects, I’ve generally found the way Hadley has presented his software libraries to be deeply pragmatic – the tools he’s developed are useful and in many senses naturalistic; they help you do the things you need to do in a way that makes practical sense. The steps they encourage you to take are natural ones, and useful ones. They are the sorts of tools that implement the sorts of ideas that come to mind when you’re faced with a problem and you think: this is the sort of thing I need (to be able) to do. (I can’t comment on how well implemented they are; I suspect: pretty well…)

Just as the data wrangling process diagram helps frame the sorts of things you’re likely to do into steps that make sense in a “folk computational” way (in the sense of folk computing or folk IT (also here), a computational correlate to notions of folk physics, for example), Hadley also has a handy diagram for helping us think about the process of solving problems computationally in a more general, problem solving sense:

A cognitive think it step, identifying a problem, and starting to think about what sort of answer you want from it, as well as how you might start to approach it; a describe it step, where you describe precisely what it is you want to do (the sort of step where you might start scribbling pseudo-code, for example); and the computational do it step where the computational grunt work is encoded in a way that allows it to actually get done by machine.

I’ve been pondering my own stance towards computing lately, particularly from my own context of someone who sees computery stuff from a more technology, tool building and tool using context, (that is, using computery things to help you do useful stuff), rather than framing it as a purer computer science or even “trad computing” take on operationalised logic, where the practical why is often ignored.

Figuring out what the hell it is you want to do (imagining, the what for a particular why), figuring out how to do it (precisely; the programming step; the how); hacking that idea into a form that lets a machine actually do it for you (the coding step; the step where you express the idea in a weird incantation where every syllable has to be the right syllable; and from which the magic happens).

One of the nice things about Hadley’s approach to supporting practical spell casting (?!) is that transformation or operational steps his libraries implement are often based around naturalistic verbs. They sort of do what they say on the tin. For example, in the dplyr toolkit, there are the following verbs:

These sort of map onto elements (often similarly named) familiar to anyone who has used SQL, but in a friendlier way. (They don’t SHOUT AT YOU for a start.) It almost feels as if they have been designed as articulations of the ideas that come to mind when you are trying to describe (precisely) what it is you actually want to do to a dataset when working on a particular problem.

In a similar way, the ggvis library (the interactive chart reinvention of Hadley’s ggplot2 library) builds on the idea of Leland Wilkinson’s “The Grammar of Graphics” and provides a way of summoning charts from data in an incremental way, as well as a functionally and grammatically coherent way. The words the libraries use encourage you to articulate the steps you think you need to take to solve a problem – and then, as if by magic, they take those steps for you.

If programming is the meditative state you need to get into to cast a computery-thing spell, and coding is the language of magic, things like dplyr help us cast spells in the vernacular.

Rediscovering Formula One Race Battlemaps

A couple of days ago, I posted a recipe on the F1DataJunkie blog that described how to calculate track position from laptime data.

Using that information, as well as additional derived columns such as the identity of, and time to, the cars immediately ahead of and behind a particular selected driver, both in terms of track position and race position, I revisited a chart type I first started exploring several years ago – race battle charts.

The main idea behind the battlemaps is that they can help us search for stories amidst the runners.

dirattr=function(attr,dir='ahead') paste(attr,dir,sep='')

#We shall find it convenenient later on to split out the initial data selection
battlemap_df_driverCode=function(driverCode){
lapTimes[lapTimes['code']==driverCode,]
}

car_X=dirattr('car_',dir)
code_X=dirattr('code_',dir)
factor_X=paste('factor(position_',dir,'<position)',sep='')
code_race_X=dirattr('code_race',dir)

g=g+geom_hline(aes_string(yintercept=drs),linetype=5,col='grey')

#Plot the offlap cars that aren't directly being raced
g=g+geom_text(data=df[df[dirattr('code_',dir)]!=df[dirattr('code_race',dir)],],
aes_string(x='lap',
y=car_X,
label=code_X,
col=factor_X),
angle=45,size=2)
#Plot the cars being raced directly
g=g+geom_text(data=df,
aes_string(x='lap',
y=diff_X,
label=code_race_X),
angle=45,size=2)
g+guides(col=guide_legend(title='Intervening car'))

}

battle_WEB=battlemap_df_driverCode('WEB')
battlemap_core_chart(battle_WEB,g,dir='behind')


In this first sketch, from the 2012 Australian Grand Prix, I show the battlemap for Mark Webber:

We see how at the start of the race Webber kept pace with Alonso, albeit around about a second behind, at the same time as he drew away from Massa. In the last third of the race, he was closely battling with Hamilton whilst drawing away from Alonso. Coloured labels are used to highlight cars on a different lap (either ahead (aqua) or behind (orange)) that are in a track position between the selected driver and the car one place ahead or behind in terms of race position (the black labels). The y-axis is the time delta in milliseconds between the selected car and cars ahead (y > 0) or behind (y < 0). A dashed line at the +/- one second mark identifies cars within DRS range.

As well as charting the battles in the vicinity of a particular driver, we can also chart the battle in the context of a particular race position. We can reuse the chart elements and simply need to redefine the filtered dataset we are charting.

For example, if we filter the dataset to just get the data for the car in third position at the end of each lap, we can then generate a battle map of this data.

battlemap_df_position=function(position){
lapTimes[lapTimes['position']==position,]
}

battleForThird=battlemap_df_position(3)

g=battlemap_core_chart(battleForThird,ggplot(),dir='behind')+xlab(NULL)+theme_bw()
g

For more details, see the original version of the battlemap chapter. For updates to the chapter, I recommend that you invest in a copy Wrangling F1 Data With R book if you haven’t already done so:-)

Connecting RStudio and MySQL Docker Containers – an example using the ergast db

building on Dockerising Open Data Databases – First Fumblings and my Book Extras – Data Files, Code Files and a Dockerised Application, I just figured out how to get the ergast db into a MySQL docker container and then query it from RStudio:

• install these docker-mysql-scripts
• run boot2docker
• from the boot2docker shell, start up a MySQL server (ergastdb) with password f1: dmysql-server ergastdb f1 By default, this exposes port 3306
• create an new empty database (f1db): dmysql-create-database ergastdb f1db
• add the ergast data to it: dmysql-import-database ergastdb /path/to/ergastdb/f1db.sql --database f1db
• fire up a copy of RStudio, in this case using my psychemedia/wranglingf1data container, linking it to the MySQL database which has the alias db: docker run --name f1djd -p 8788:8787 --link ergastdb:db -d psychemedia/wranglingf1data
• in RStudio, import the RMySQL library: library(RMySQL)
• in RStudio, connect to the database: con=dbConnect(MySQL(),user='root',password='f1',host='db',port=3306,dbname='f1db')
• in RStudio, run a test query: dbQuery(con,'SHOW TABLES');

I guess what I need to do now is pull the various bits into another script to make it a one-liner, perhaps with a few switches? For example, to create the database if it doesn’t exist, to download the ergast database file automatically, to populate the database for the first time, or update it with a more recent copy of the database, to fire up both containers and make sure they are appropriately linked etc. This would dramatically simplify things for use in the context of the Wrangling F1 Data With R book, for example. (If you beat me to it, please post details in the comments below.)

PS Hmm…. seems I get a UTF-8 encoding issue:

Not sure if this is with the database, or the RMySQL connector? Anyone got any ideas of a fix?

Ah ha – sort of via SO:

Running dbGetQuery(con,'SET NAMES utf8;') before querying seems to do the trick…

Calculating Churn in Seasonal Leagues

One of the things I wanted to explore in the production of the Wrangling F1 Data With R book was the extent to which I could draw on published academic papers for inspiration in exploring the the various results and timing datasets.

In a chapter published earlier this week, I explored the notion of churn, as described in Mizak, D, Neral, J & Stair, A (2007) The adjusted churn: an index of competitive balance for sports leagues based on changes in team standings over time. Economics Bulletin, Vol. 26, No. 3 pp. 1-7, and further appropriated by Berkowitz, J. P., Depken, C. A., & Wilson, D. P. (2011). When going in circles is going backward: Outcome uncertainty in NASCAR. Journal of Sports Economics, 12(3), 253-283.

In a competitive league, churn is defined as:

$C_t = \frac{\sum_{i=1}^{N}\left|f_{i,t} - f_{i,t-1}\right|}{N}$

where $C_t$ is the churn in team standings for year $t$, $\left|f_{i,t} - f_{i,t-1}\right|$ is the absolute value of the $i$-th team’s change in finishing position going from season $t-1$ to season $t$, and $N$ is the number of teams.

The adjusted churn is defined as an indicator with the range 0..1 by dividing the churn, $C_t$, by the maximum churn, $C_max$. The value of the maximum churn depends on whether there is an even or odd number of competitors:

$C_{max} = N/2 \text{, for even N}$

$C_{max} = (N^2 - 1) / 2N \text{, for odd N}$

Berkowitz et al. reconsidered churn as applied to an individual NASCAR race (that is, at the event level). In this case, $f_{i,t}$ is the position of driver $i$ at the end of race $t$, $f_{i,t-1}$ is the starting position of driver $i$ at the beginning of that race (that is, race $t$) and $N$ is the number of drivers participating in the race. Once again, the authors recognise the utility of normalising the churn value to give an *adjusted churn* in the range 0..1 by dividing through by the maximum churn value.

Using these models, I created churn function of the form:

is.even = function(x) x %% 2 == 0
churnmax=function(N)
if (is.even(N)) return(N/2) else return(((N*N)-1)/(2*N))

churn=function(d) sum(d)/length(d)
adjchurn = function(d) churn(d)/churnmax(length(d))

and then used it to explore churn in a variety of contexts:

• comparing grid positions vs race classifications across a season (cf. Berkowitz et al.)
• churn in Drivers’ Championship standings over several seasons (cf. Mizak et al.)
• churn in Constructors’ Championship standings over several seasons (cf. Mizak et al.)

For example, in the first case, we can process data from the ergast database as follows:

library(DBI)
ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

q=paste('SELECT round, name, driverRef, code, grid,
position, positionText, positionOrder
FROM results rs JOIN drivers d JOIN races r
ON rs.driverId=d.driverId AND rs.raceId=r.raceId
WHERE r.year=2013',sep='')
results=dbGetQuery(ergastdb,q)

library(plyr)
results['delta'] =  abs(results['grid']-results['positionOrder'])
churn.df = ddply(results[,c('round','name','delta')], .(round,name), summarise,
churn = churn(delta),
)


For more details, see this first release of the Keeping an Eye on Competitiveness – Tracking Churn chapter of the Wrangling F1 Data With R book.

Book Extras – Data Files, Code Files and a Dockerised Application

Idling through the LeanPub documentation last night, I noticed that they support the ability to sell digital extras, such as bundled code files or datafiles. Along with the base book sold at one price, additional extras can be bundled into packages alongside the original book and sold at another (higher) price. As with the book sales, two price points are supported: the minimum price and a recommended price.

It was easy enough to create a bundle of sample code and data files to support the Wrangling F1 Data With R book and add them as an extras package bundled with the book for an extra dollar or so.

This approach makes it slightly easier to distribute file bundles to support a book, but it still requires a reader to do some work in configuring their own development environment.

In an attempt to make this slightly easier, I also started exploring ways of packaging and distributing a preconfigured virtual machine that contains the tools – as well as code and data examples – that are required in order to try out the data wrangling approaches described in the book. (I’m starting to see more and more technical books supported by virtual machines, and can imagine this approach becoming a standard way of augmenting practical texts.) In particular, I needed a virtual machine that could run RStudio and that would be preloaded with libraries that would support working with SQLite data files and generating ggplot2 charts.

The route I opted for was to try out a dockerised application. The rocker/hadleyverse Docker image bundles a wide variety of useful R packages into a container along with RStudio and a base R installation. Building on top of this image, I created a small Dockerfile that additionally loaded in the example code and data files from the book extras package – psychemedia/wranglingf1data.

# Wrangling F1 Data Dockerfile
#
# https://github.com/psychemedia/wranglingf1data-docker
#

# Pull RStudio base image with handy packages...

#Create a directory to create a new project from
RUN mkdir -p /home/rstudio/wranglingf1data
RUN chmod a+rw /home/rstudio/wranglingf1data

#Populate the project-directory-to-be with ergast API supporting code and data