One of the slides I whizzed by in my presentation at the OU Statistitcs conference on “Visualisation an Presentation in Statistics” (tweet-commentary notes/links from other presentations here and here) relates to what we might describe as the shape that data takes.
An easy way of thinking about the shape of a dataset is is to consider a simple 2D data table with columns describing the properties of an object and rows typically corresponding to individual things. Often a regular structure, each cell in the table may take on a valid value. Occasionally, some cells may be blank, in which case we can start to think of the shape of the data getting a little bit ragged.
If you are working with data table, then on occasion you may want to swap the rows for columns (certain data operations require data to be organised in a particular way). By swapping the rows and columns, we change the shape of the data (for example, going from a table with N rows and M columns to one with M columns and N rows). So that’s one way of reshaping your data.
Many visualisation tools require data to be in a particular shape in order for the data to be visualised appropriately. If you look at the template pages on Number Picture, a new site hosting templated visualiastions built using Processing that allow you to cut, paste and visualise data – if it is is appropriately shaped – at a click.
But where do pivot tables come in? One way is to think of them as a tool for reshaping your data by providing summary reports of your original data set.
Here’s how the Goog describes them:
What pivot tables allow you to do is generate reports based on contents of a table using the values contained within one or more columns to define the columns and rows of a summary table. That is, you can re-present (or re-shape) a table as a new table that summarises data contained in the original table in terms of a rearrangement of the cell values of the original table.
Here’s a quick example. I have a data set that identifies the laptimes of drivers in an F1 race (yes, I know… again!;-) by stint, where different stints are groups of consecutive laps separated by pit stops.
If you look down the stint column you can see how its value groups together blocks of rows. But how do I easily show how much time each driver (column C) spent on each stint? The time the driver spent on each stint is the sum of laptimes by driver within a stint, so for each driver I need to find out the laps associated with each stint, and then sum them. Pivot tables let me do that. Here’s how:
So how does this work? The columns in the new table are defined according to the unique values found in the stint column of the original table. The rows in the new table are defined according to the unique values found in the car column of the original table. The cell values in the new table for a given row and column are defined as the sum of lapTime data values from rows in the original table where the car and stint values in the row correspond to the row and column values corresponding to each cell in the new table. Got that?
If you’re familiar with databases, you might think of the column and row settings in the new table defining “keys” into rows on the original table. The different car/stint pairs identify different blocks of rows that are processed per block to create the contents of the pivot table.
As well as summing the values from one column based on the values contained in two other columns, the pivot table can do other operations, such as counting the number of rows in the original table containing each “key” value. So for example, if we want to count the number of laps a car was out for by stint, we can do that simply by changing out pivot table Values setting.
Pivot tables can take a bit of time to get your head round… I find playing with them helps… A key thing to remember is: if you want to express a dataset in terms of the unique values contained within a column, the pivot table can help you do that. In the above example, I was essentially generating the row and column values for a new table based on categorical data (driver/car number and stint number). Another example might be sales data where the same categories of item appear across multiple rows and you want to generate reports based on category.