Fumblings with Ranked Likert Scale Data in R

The code is horrible and the visualisations quite possibly misleading, but I’m dead tired and there are a couple of tricks in the following R code that I want to remember, so here’s a contrived bit of fumbling with some data of the form:

enjoyCompany tooMuchFamily
1 strongly agree strongly disagree
2 strongly agree strongly disagree
3 neither agree nor disagree strongly disagree

That is, N rows, no identifiers, two columns; each column relates to a questionnaire question with a scaled response enumerated as ‘strongly agree’,’agree ‘,’neither agree nor disagree’,’disagree’,’strongly disagree’.

THe first thing I tried to do was some “traditional” Likert scale style stacked bar charts using ggplot2 (surely there must be a Likert scale visualisation library around? If so, how would it work with data in the above (and below) forms? Answers via the comments please…)

#My sample data doesn't have row based identifiers, so here's a hacked incremental index based ID
#melt the data into a dataframe with 3 cols: the id col, /b/; a /variable/ column that contains the original column heading; and a /value/ column that contains the original cell value for the corresponding row and column.
#Get rid of blank values
#Get rid of unused levels
#Reorder the levels into a meaningful order
ff$value <- factor(ff$value, levels =rev(c('strongly agree','agree ','neither agree nor disagree','disagree','strongly disagree')))
ggplot(ff)+geom_bar(aes(variable,fill=value))+ coord_flip()

A couple of notable issues with the resulting diagram:

– the colours aren’t that pleasing to look at;
– we have lost all sense of correlation between values. We may like to think that the agree/strongly agree ratings from one question are corrleated with the disagree/strongly disagree responses from the other, but there is nothing in that chart that says this for sure…

However, a pairwise comparison may help…

#Let's count how many times the different scale values occur with each other, and then plot some sort of correlation plot.
fs=subset(fs,enjoyCompany!='' & tooMuchFamily!='')
fs$enjoyCompany <- factor(fs$enjoyCompany, levels =rev(c('strongly agree','agree ','neither agree nor disagree','disagree','strongly disagree')))
fs$tooMuchFamily <- factor(fs$tooMuchFamily, levels =rev(c('strongly agree','agree ','neither agree nor disagree','disagree','strongly disagree')))

If I had rather more than two question columns, how would I generate a lattice of pairwise correlation charts to get a visual overview of the how all the question answers interact at the pairwise level?

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

9 thoughts on “Fumblings with Ranked Likert Scale Data in R”

  1. Hi Tony,
    I decided to follow up on this post with one of my own. I didn’t get up to the pairwise correlation charts, but that’s mostly because my current work isn’t using any.

    So check out this post and let me know if it helps at all:

    I’ve had to do my own hacked IDs before, but I find it’s better to actually produce proportions by response before preparing the graphic. You just add stat=’identity’. I find the code more readable, and then it’s easier to produce a table to support the graphic later because I am already storing the summarized data.

    1. Hi Jason -thanks for the comment, and sharing the post. I’ll give the code a go as soon as I get a chance:-)

      I came across the net stacked chart a couple of weeks ago and thought it looked really promising, though didn’t have an immediate need to try it out until yesterday (I did think the x-axis range in the example on http://www.organizationview.com/net-stacked-distribution-a-better-way-to-visualize-likert-data should have gone +/- 50 though? (Hmm, +/-100 would be the general case, wouldn’t it? I guess if you also did +/- sample size, you could get an indication of how many subjects in the sample actually responded in a definite way, which may not be much use from a stats point of view but may be useful from a macroscopic overview of the data perspective? This then raises the question, in the net stacked chart using proportions, what is actually not shown? Folk who respond ambivalently, or folk who responded ambivalently *or* didn’t answer the question? Typically, I guess it’s the former, although it may depend on the questionnaire protocol/default settings?!)

    2. I prefer the proportion because neutral replies are being removed. I don’t want to confuse the number of respondents to a question with the number of neutral responses. Different response rates by question may be interpreted as different levels of indifference.

      In my code, I remove all NAs, so I’m looking purely at the proportion who responded a particular way of those who responded. This may not be right for the general case, but for the data I am working with right now I think it makes the most sense. We did not have the questions randomized (I’m very unhappy with that), and therefore, most lack of response is attrition, not some indicator of ambivalence or discomfort with the questions.

  2. How is your correlation plot different from a contingency table? In fact, an actual contingency table with some heatmap overlay might make more sense.

  3. Hi Tony,

    As a RSS subscriber I was going to comment on this earlier but work took over as is too often the case.

    Thanks for your comments on my ‘net stacked distribution’ post. I totally agree on the points you made regarding the scale; it should be symmetrical. Should it be +/- 100%? I’m not so sure on that. I’m currently finding out what the maximum value either positive or negative and scaling to that. The scale certainly should be consistent across graphics.

    The key for this, as with any visualisation, is to understand what people are looking for in the data, and what they are likely to do after reading the visualisation. We use usability testing to help with this. The two biggest uses are (a) exploring the data (b) using it for driving decision making. Typically users fit into one or other of the categories at any time and this should drive how you present the data.

    As for the neutral / NA responses I think it depends on what you’re trying to show. This technique is most useful for group (b) and therefore understanding the user needs helps to drive thinking. I find showing them as numbers and offering sorting based on that works well.

    The contingency table / heat map is also something I use extensively in the exploration phase. They’re easy enough to build in Tableau if you have a small number of questions but get cumbersome as questions increase. I’ll have to have a go with R, one of several summer projects I suspect.

    1. Hi Andrew – Thanks for the comments too:-)

      Re: the scale – re: the +/-100%, I was trying to think to the most general/pathological case, whilst appreciating that the visual aesthetic would probably not be too good for most actual cases! WRT understanding the intent of either the analyst or the decisionmaker, I quite agree that that the context of creation or use/interpretation of the visualisation is likely to heavily influence the utility of any particular view. So for example, when trying to get a feel for a new data set, I like to try to find “macroscopic” views over the data as a form of orientation.

      Re: showing/not showing NAs depending on what you’re trying to show – I think the point I was trying to make was exactly that:-) That is, I was just idly enumerating some of the different ways that the graphic could be constructed, as a note-to-self as much as anything…;-)

      Re: contingency tables/heat maps for exploration – I’m on a Mac, so don’t benefit from Tableau;-) But I think that I probably should make more use of that sort of view… I’m sure Stack Overflow will point me in the right direction for some R/ggplot code to achieve it;-)

      Finally, as to R: I need to blog some more of my thinking around why I find R so attractive to see if it resonates with others. Several key elements come to mind: 1) the literal aspect (being able to script the steps taken in an analysis or chart construction); 2) the power (eg ito shaping and representing data); 3) its usefulness as glue (eg the ability to import data from a variety of sources); 4) workflow (in several respects; eg in RStudio, integration with git for version control, Sweave for PDF generation, RMarkdown/knitr HTML generation (and also 1-click publishing to sites like rpubs.org)); 5) ability to operate in the context of a browser consumed service eg online hosted versions of RStudio, or Jeroen Ooms’ browser fronted UI to ggplot2; 6) support for “easy” creation of (statistical) charts using libraries such as ggplot2 and googleVis And that’s aside from it’s usefulness as a stats environment…;-)

  4. I fully agree with you regarding your thoughts on R. For the heat maps I’m using R to create the tables before blending in to Tableau to visualise.

    As far as Tableau is concerned, I am also on a Mac but run it in Parallels. There are a lot of us Mac-based Tableau users and we constantly nag about getting a native Mac version. I think it might happen but not in the immediate future. In the meantime running it in Parallels works, but is not ideal.

    Tableau’s main strengths are (a) to explore data quickly (b) prototype displays that you might then develop in R (or something like D3).

Comments are closed.

%d bloggers like this: