OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Rstats’ Category

Information Density and Custom Chart Designs

leave a comment »

I’ve been doodling today with a some charts for the Wrangling F1 Data With R living book, trying to see how much information I can start trying to pack into a single chart.

The initial impetus came simply from thinking about a count of laps led in a particular race by each drive; this morphed into charting the number of laps in each position for each driver, and then onto a more comprehensive race summary chart (see More Shiny Goodness – Tinkering With the Ergast Motor Racing Data API for an earlier graphical attempt at producing a race summary chart).

lapPosChart

The chart shows:

- grid position: identified using an empty grey square;
race position after the first lap: identified using an empty grey circle;
race position on each driver’s last lap: y-value (position) of corresponding pink circle;
points cutoff line: a faint grey dotted line to show which positions are inside – or out of – the points;
number of laps completed by each driver: size of pink circle;
total laps completed by driver: greyed annotation at the bottom of the chart;
whether a driver was classified or not: the total lap count is displayed using a bold font for classified drivers, and in italics for unclassified drivers;
finishing status of each driver: classification statuses other than *Finished* are also recorded at the bottom of the chart.

The chart also shows drivers who started the race but did not complete the first lap.

What the chart doesn’t show is what stage of the race the driver was in each position, and how long for. But I have an idea for another chart that could help there, as well as being able to reuse elements used in the chart shown here.

FWIW, the following fragment of R code shows the ggplot function used to create the chart. The data came from the ergast API, though it did require a bit of wrangling to get it into a shape that I could use to power the chart.

#Reorder the drivers according to a final ranked position
g=ggplot(finalPos,aes(x=reorder(driverRef,finalPos)))
#Highlight the points cutoff
g=g+geom_hline(yintercept=10.5,colour='lightgrey',linetype='dotted')
#Highlight the position each driver was in on their final lap
g=g+geom_point(aes(y=position,size=lap),colour='red',alpha=0.15)
#Highlight the grid position of each driver
g=g+geom_point(aes(y=grid),shape=0,size=7,alpha=0.2)
#Highlight the position of each driver at the end of the first lap
g=g+geom_point(aes(y=lap1pos),shape=1,size=7,alpha=0.2)
#Provide a count of how many laps each driver held each position for
g=g+geom_text(data=posCounts,
              aes(x=driverRef,y=position,label=poscount,alpha=alpha(poscount)),
              size=4)
#Number of laps completed by driver
g=g+geom_text(aes(x=driverRef,y=-1,label=lap,fontface=ifelse(is.na(classification), 'italic' , 'bold')),size=3,colour='grey')
#Record the status of each driver
g=g+geom_text(aes(x=driverRef,y=-2,label=ifelse(status!='Finished', status,'')),size=2,angle=30,colour='grey')
#Styling - tidy the chart by removing the transparency legend
g+theme_bw()+xRotn()+xlab(NULL)+ylab("Race Position")+guides(alpha=FALSE)

The fully worked code can be found in forthcoming update to the Wrangling F1 Data With R living book.

Written by Tony Hirst

November 21, 2014 at 6:21 pm

Posted in Rstats

Tagged with ,

F1 Championship Race, 2014 – Winning Combinations…

with 2 comments

As we come up to the final two races of the 2014 Formula One season, the double points mechanism for the final race means that two drivers are still in with a shot at the Drivers’ Championship: Lewis Hamilton and Nico Rosberg.

As James Allen describes in Hamilton closes in on world title: maths favour him but Abu Dhabi threat remains:

Hamilton needs 51 points in the remaining races to be champion if Rosberg wins both races. Hamilton can afford to finish second in Brazil and at the double points finale in Abu Dhabi and still be champion. Mathematically he could also finish third in Brazil and second in the finale and take it on win countback, as Rosberg would have just six wins to Hamilton’s ten.
If Hamilton leads Rosberg home again in a 1-2 in Brazil, then he will go to Abu Dhabi needing to finish fifth or higher to be champion (echoes of Brazil 2008!!). If Rosberg does not finish in Brazil and Hamilton wins the race, then Rosberg would need to win Abu Dhabi with Hamilton not finishing; no other scenario would give Rosberg the title.

A couple of years ago, I developed an interactive R/shiny app for exploring finishing combinations of two drivers in the last two races of a season to see what situations led to what result: Interactive Scenarios With Shiny – The Race to the F1 2012 Drivers’ Championship.

f12014champshiny

I’ve updated the app (taking into account the matter of double points in the final race) so you can check out James Allen’s calculations with it (assuming I got my sums right too!). I tried to pop up an interactive version to Shinyapps, but the Shinyapps publication mechanism seems to be broken (for me at least) at the moment…:-(

In the meantime, if you have RStudio installed, you can run the application yourself. The code is avaliable and can be run from RStudio with: runGist("81380ff09ebe1cd67005")

When I get a chance, I’ll weave elements of this recipe into the Wrangling F1 Data With R book.

PS I’ve also started using the F1dataJunkie blog again as a place to post drafts and snippets of elements I’m working on for that book…

Written by Tony Hirst

November 8, 2014 at 2:07 pm

Posted in Rstats

Tagged with

Wrangling F1 Data With R – F1DataJunkie Book

with 2 comments

Earlier this year I started trying to pull together some of my #f1datajunkie R-related ramblings together in a book form. The project stalled, but to try to reboot it I’ve started publishing it as a living book over on Leanpub. Several of the chapters are incomplete – with TO DO items sketched in, others are still unpublished. The beauty of the Leanpub model is that if you buy a copy, you continue to get access to all future updated versions of the book. (And my idea is that by getting the book out there as it is, I’ll feel as if there’s more (social) pressure on actually trying to keep up with it…)

I’ll be posting more details about how the Leanpub process works (for me at least) in the next week or two, but for now, here’s a link to the book: Wrangling F1 Data With R: A Data Junkie’s Guide.

Here’s the table of contents so far:

  • Foreword
    • A Note on the Data Sources
  • Introduction
    • Preamble
    • What are we trying to do with the data?
    • Choosing the tools
    • The Data Sources
    • Getting the Data into RStudio
    • Example F1 Stats Sites
    • How to Use This Book
    • The Rest of This Book…
  • An Introduction to RStudio and R dataframes
    • Getting Started with RStudio
    • Getting Started with R
    • Summary
  • Getting the data from the Ergast Motor Racing Database API
    • Accessing Data from the ergast API
    • Summary
  • Getting the data from the Ergast Motor Racing Database Download
    • Accessing SQLite from R
    • Asking Questions of the ergast Data
    • Summary
    • Exercises and TO DO
  • Data Scraped from the F1 Website
    • Problems with the Formula One Data
    • How to use the FormulaOne.com alongside the ergast data
  • Reviewing the Practice Sessions
    • The Weekend Starts Here
    • Practice Session Data from the FIA
    • Sector Times
    • FIA Media Centre Timing Sheets
  • A Quick Look at Qualifying
    • Qualifying Session Position Summary Chart
    • Another Look at the Session Tables
    • Ultimate Lap Positions
  • Lapcharts
    • Annotated Lapcharts
  • Race History Charts
    • The Simple Laptime Chart
    • Accumulated Laptimes
    • Gap to Leader Charts
    • The Lapalyzer Session Gap
    • Eventually: The Race History Chart
  • Pit Stop Analysis
    • Pit Stop Data
    • Total pit time per race
    • Pit Stops Over Time
    • Estimating pit loss time
    • Tyre Change Data
  • Career Trajectory
    • The Effect of Age on Performance
    • Statistical Models of Career Trajectories
    • The Age-Productivity Gradient
    • Summary
  • Streakiness
    • Spotting Runs
    • Generating Streak Reports
    • Streak Maps
    • Team Streaks
    • Time to N’th Win
    • TO DO
    • Summary
  • Conclusion
  • Appendix One – Scraping formula1.com Timing Data
  • Appendix Two – FIA Timing Sheets
    • Downloading the FIA timing sheets for a particular race
  • Appendix – Converting the ergast Database to SQLite

If you think you deserve a free copy, let me know… ;-)

Written by Tony Hirst

October 31, 2014 at 12:04 am

Posted in Rstats

Tagged with

Running “Native” Data Wrangling Applications in the Browser – IPython Notebooks (and R?) in Chrome

leave a comment »

Using browser based data analysis toolkits such as pandas in IPython notebooks, or R in RStudio, means you need to have access to python or R and the corresponding application server either on your own computer, or running on a remote server that you have access to.

When running occasional training sessions or workshops, this can cause several headaches: either a remote service needs to be set up that is capable of supporting the expected number of participants, security may need putting in place, accounts configured (or account management tools supported), network connections need guaranteeing so that participants can access the server, and so on; or participants need to install software on their own computers: ideally this would be done in advance of a training session, otherwise training time is spent installing, configuring and debugging software installs; some computers may have security policies that prevent users installing software, or require and IT person with admin privileges to install the software, and so on.

That’s why the coLaboratory Chrome extension looks like an interesting innovation – it runs an IPython notebook fork, with pandas and matplotlib as a Chrome Native Client application. I posted a quick walkthrough of the extension over on the School of Data blog: Working With Data in the Browser Using python – coLaboratory.

Via a Twitter exchange with @nativeclient, it seems that there’s also the possibility that R could run as a dependency free Chrome extension. Native Client seems to like things written in C/C++, which underpins R, although I think R also has some fortran dependencies. (One of the coLaboratory talks mentioned the to do list item of getting scipy (I think?) running in the coLaboratory extension, the major challenge there (or whatever the package was) being the fortran src; so there maybe be synergies in working the fortran components there?))

Within a couple of hours of the twitter exchange starting, Brad Nelson/@flagxor posted a first attempt at an R port to the Native Client. I don’t pretend to understand what’s involved in moving from this to an extension with some sort of useable UI, even if only a command line, but it represents an interesting possibility: of being able to run R in the browser (or at least, in Chrome). Package availability would be limited of course to packages compiled to run using PNaCl.

For training events, there is still the requirement that users install a Chrome browser on their computer and then install the extension into that. However, I think it is possible to run Chrome as a portable app – that is, from a flash drive such as a USB memory stick: Google Chrome Portable (Windows).

I’m not sure how fast it would be able to run, but it suggests there may be a way of carrying a portable, dependency free pandas environment around that you can run on a Windows computer from a USB key?! And maybe R too…?

Written by Tony Hirst

August 22, 2014 at 9:42 am

Opening Up Access to Data: Why APIs May Not Be Enough…

Last week, a post on the ONS (Office of National Statistics) Digital Publishing blog caught my eye: Introducing the New Improved ONS API which apparently “mak[es] things much easier to work with”.

Ooh… exciting…. maybe I can use this to start hacking together some notebooks?:-)

It was followed a few days later by this one – ONS-API, Just the Numbers which described “a simple bit of code for requesting some data and then turning that into ‘just the raw numbers’” – a blog post that describes how to get a simple statistic, as a number, from the API. The API that “mak[es] things much easier to work with”.

After a few hours spent hacking away over the weekend, looking round various bits of the API, I still wasn’t really in a position to discover where to find the numbers, let alone get numbers out of the API in a reliable way. (You can see my fumblings here.) Note that I’m happy to be told I’m going about this completely the wrong way and didn’t find the baby steps guide I need to help me use it properly.

So FWIW, here are some reflections, from a personal standpoint, about the whole API thing from the perspective of someone who couldn’t get it together enough to get the thing working …


Most data users aren’t programmers. And I’m not sure how many programmers are data junkies, let alone statisticians and data analysts.

For data users who do dabble with programming – in R, for example, or python (for example, using the pandas library) – the offer of an API is often seen as providing a way of interrogating a data source and getting the bits of data you want. The alternative to this is often having to download a huge great dataset yourself and then querying it or partitioning it yourself to get just the data elements you want to make use of (for example, Working With Large Text Files – Finding UK Companies by Postcode or Business Area).

That’s fine, insofar as it goes, but it starts to give the person who wants to do some data analysis a data management problem too. And for data users who aren’t happy working with gigabyte data files, it can sometimes be a blocker. (Big file downloads also take time, and incur bandwidth costs.)

For me, a stereotypical data user might be someone who typically wants to be able to quickly and easily get just the data they want from the API into a data representation that is native to the environment they are working in, and that they are familiar with working with.

This might be a spreadsheet user or it might be a code (R, pandas etc) user.

In the same way that spreadsheet users want files in XLS or CSV format that they can easily open, (formats that can be also be directly opened into appropriate data structures in R or pandas), I increasingly look not for APIs, but for API wrappers, that bring API calls and the results from them directly into the environment I’m working in in a form appropriate to that environment.

So for example, in R, I make use of the FAOstat package, which also offers an interface to the World Bank Indicators datasets. In pandas, a remote data access handler for the World Bank Indicators portal allows me to make simple requests for that data.

At a level up (or should that be “down”?) from the API wrapper are libraries that parse typical response formats. For example, Statistics Norway seem to publish data using the json-stat format, the format used in the new ONS API update. This IPython notebook shows how to use the pyjstat python package to parse the json-stat data directly into a pandas dataframe (I couldn’t get it to work with the ONS data feed – not sure if the problem was me, the package, or the data feed; which is another problem – working out where the problem is…). For parsing data returned from SPARQL Linked Data endpoints, packages such as SPARQLwrapper get the data into Python dicts, if not pandas dataframes directly. (A SPARQL i/o wrapper for pandas could be quite handy?)

At the user level, IPython Notebooks (my current ‘can be used to solve all known problems’ piece of magic tech!;-) provide a great way of demonstrating not just how to get started with an API, but also encourage the development within the notebook or reusable components, as well as demonstrations of how to use the data. The latter demonstrations have the benefit of requiring that the API demo does actually get the data into a form that is useable within the environment. It also helps folk see what it means to be able to get data into the environment (it means you can do things like the things done in the demo…; and if you can do that, then you can probably also do other related things…)

So am I happy when I see APIs announced? Yes and no… I’m more interested in having API wrappers available within my data wrangling environment. If that’s a fully blown wrapper, great. If that sort of wrapper isn’t available, but I can use a standard data feed parsing library to parse results pulled from easily generated RESTful URLs, I can just about work out how to create the URLs, so that’s not too bad either.

When publishing APIs, it’s worth considering who can address them and use them. Just because you publish a data API doesn’t mean a data analyst can necessarily use the data, because they may not be (are likely not to be) a programmer. And if ten, or a hundred, or a thousand potential data users all have to implement the same sort of glue code to get the data from the API into the same sort of analysis environment, that’s not necessarily efficient either. (Data users may feel they can hack some code to get the data from the API into the environment for their particular use case, but may not be willing to release it as a general, tested and robust API wrapper, certainly not a stable production level one.)

This isn’t meant to be a slight against the ONS API, more a reflection on some of the things I was thinking as I hacked my weekend away…

PS I don’t know how easy it is to run Python code in R, but the R magic in IPython notebooks supports the running of R code within a notebook running a Python kernel, with the handing over of data from R dataframes to python dataframes. Which is to say, if there’s an R package available, for someone who can run R via an IPython context, it’s available via python too.

PPS I notice that from some of the ONS API calls we can get links to URLs of downloadable datasets (though when I tried some of them, I got errors trying to unzip the results). This provides an intermediate way of providing API access to a dataset – search based API calls that allow discovery of a dataset, then the download and automatic unpacking of that dataset into a native data representation, such as one or more data frames.

Written by Tony Hirst

August 11, 2014 at 2:04 pm

Posted in Data, Rstats

Tagged with

F1 Doing the Data Visualisation Competition Thing With Tata?

Sort of via @jottevanger, it seems that Tata Communications announces the first challenge in the F1® Connectivity Innovation Prize to extract and present new information from Formula One Management’s live data feeds. (The F1 site has a post Tata launches F1® Connectivity Innovation Prize dated “10 Jun 2014″? What’s that about then?)

Tata Communications are the folk who supply connectivity to F1, so this could be a good call from them. It’ll be interesting to see how much attention – and interest – it gets.

The competition site can be found here: The F1 Innovation Connectivity Prize.

The first challenge is framed as follows:

The Formula One Management Data Screen Challenge is to propose what new and insightful information can be derived from the sample data set provided and, as a second element to the challenge, show how this insight can be delivered visually to add suspense and excitement to the audience experience.

The sample dataset provided by Formula One Management includes Practice 1, Qualifying and race data, and contains the following elements:

- Position
– Car number
– Driver’s name
– Fastest lap time
– Gap to the leader’s fastest lap time
– Sector 1 time for the current lap
– Sector 2 time for the current lap
– Sector 3 time for the current lap
– Number of laps

If you aren’t familiar with motorsport timing screens, they typically look like this…

f1-innovation-prize_s3_amazonaws_com_challenge_packs_The_F1_Connectivity_Innovation_Prize_–_Challenge_1_Brief_pdf

A technical manual is also provided for helping makes sense of the data files.

Basic_Timing_Data_Protocol_Overview_pdf__page_1_of_15_

Here are fragments from the data files – one for practice, one for qualifying and one for the race.

First up, practice:

...
<transaction identifier="101" messagecount="10640" timestamp="10:53:14.159"><data column="2" row="15" colour="RED" value="14"/></transaction>
<transaction identifier="101" messagecount="10641" timestamp="10:53:14.162"><data column="3" row="15" colour="WHITE" value="F. ALONSO"/></transaction>
<transaction identifier="103" messagecount="10642" timestamp="10:53:14.169"><data column="9" row="2" colour="YELLOW" value="16"/></transaction>
<transaction identifier="101" messagecount="10643" timestamp="10:53:14.172"><data column="2" row="6" colour="WHITE" value="17"/></transaction>
<transaction identifier="102" messagecount="1102813" timestamp="10:53:14.642"><data column="2" row="1" colour="YELLOW" value="59:39" clock="true"/></transaction>
<transaction identifier="102" messagecount="1102823" timestamp="10:53:15.640"><data column="2" row="1" colour="YELLOW" value="59:38" clock="true"/></transaction>
...

Then qualifying:

...
<transaction identifier="102" messagecount="64968" timestamp="12:22:01.956"><data column="4" row="3" colour="WHITE" value="210"/></transaction>
<transaction identifier="102" messagecount="64971" timestamp="12:22:01.973"><data column="3" row="4" colour="WHITE" value="PER"/></transaction>
<transaction identifier="102" messagecount="64972" timestamp="12:22:01.973"><data column="4" row="4" colour="WHITE" value="176"/></transaction>
<transaction identifier="103" messagecount="876478" timestamp="12:22:02.909"><data column="2" row="1" colour="YELLOW" value="16:04" clock="true"/></transaction>
<transaction identifier="101" messagecount="64987" timestamp="12:22:03.731"><data column="2" row="1" colour="WHITE" value="21"/></transaction>
<transaction identifier="101" messagecount="64989" timestamp="12:22:03.731"><data column="3" row="1" colour="YELLOW" value="E. GUTIERREZ"/></transaction>
...

Then the race:

...
<transaction identifier="101" messagecount="121593" timestamp="14:57:10.878"><data column="23" row="1" colour="PURPLE" value="31.6"/></transaction>
<transaction identifier="103" messagecount="940109" timestamp="14:57:11.219"><data column="2" row="1" colour="YELLOW" value="1:41:13" clock="true"/></transaction>
<transaction identifier="101" messagecount="121600" timestamp="14:57:11.681"><data column="2" row="3" colour="WHITE" value="77"/></transaction>
<transaction identifier="101" messagecount="121601" timestamp="14:57:11.681"><data column="3" row="3" colour="WHITE" value="V. BOTTAS"/></transaction>
<transaction identifier="101" messagecount="121602" timestamp="14:57:11.681"><data column="4" row="3" colour="YELLOW" value="17.7"/></transaction>
<transaction identifier="101" messagecount="121603" timestamp="14:57:11.681"><data column="5" row="3" colour="YELLOW" value="14.6"/></transaction>
<transaction identifier="101" messagecount="121604" timestamp="14:57:11.681"><data column="6" row="3" colour="WHITE" value="1:33.201"/></transaction>
<transaction identifier="101" messagecount="121605" timestamp="14:57:11.686"><data column="9" row="3" colour="YELLOW" value="35.4"/></transaction>

...

We can parse the datafiles using python using an approach something like the following:

from lxml import etree

pl=[]
for xml in open(xml_doc, 'r'):
    pl.append(etree.fromstring(xml))

pl[100].attrib
#{'identifier': '101', 'timestamp': '10:49:56.085', 'messagecount': '9716'}

pl[100][0].attrib
#{'column': '3', 'colour': 'WHITE', 'value': 'J. BIANCHI', 'row': '12'}

A few things are worth mentioning about this format… Firstly, the identifier is an identifier of the message type, rather then the message: each transaction message appears instead to be uniquely identified by the messagecount. The transactions each update the value of a single cell in the display screen, setting its value and colour. The cell is identified by its row and column co-ordinates. The timestamp also appears to group messages.

Secondly, within a session, several screen views are possible – essentially associated with data labelled with a particular identifier. This means the data feed is essentially powering several data structures.

Thirdly, each screen display is a snapshot of a datastructure at a particular point in time. There is no single record in the datafeed that gives a view over the whole results table. In fact, there is no single message that describes the state of a single row at a particular point in time. Instead, the datastructure is built up by a continual series of updates to individual cells. Transaction elements in the feed are cell based events not row based events.

It’s not obvious how we can make a row based transaction update, even, though on occasion we may be able to group updates to several columns within a row by gathering together all the messages that occur at a particular timestamp and mention a particular row. For example, look at the example of the race timing data above, for timestamp=”14:57:11.681″ and row=”3″. If we parsed each of these into separate dataframes, using the timestamp as the index, we could align the dataframes using the *pandas* DataFrame .align() method.

[I think I’m thinking about this wrong: the updates to a row appear to come in column order, so if column 2 changes, the driver number, then changes to the rest of the row will follow. So if we keep track of a cursor for each row describing the last column updated, we should be able to track things like row changes, end of lap changes when sector times change and so on. Pitting may complicate matters, but at least I think I have an in now… Should have looked more closely the first time… Doh!]

Note: I’m not sure that the timestamps are necessarily unique across rows, though I suspect that they are likely to be so, which means it would be safer to align, or merge, on the basis of the timestamp and the row number? From inspection of the data, it looks as if it is possible for a couple of timestamps to differ slightly (by milliseconds) yet apply to the same row. I guess we would treat these as separate grouped elements? Depending on the timewidth that all changes to a row are likely to occur in, we could perhaps round times for the basis of the join?

Even with a bundling, we still don’t a have a complete description of all the cells in a row. They need to have been set historically…

The following fragment is a first attempt at building up the timing screen data structure for the practice timing at a particular point of time. To find the state of the timing screen at a particular time, we’d have to start building it up from the start of time, and then stop it updating at the time we were interested in:

#Hacky load and parse of each row in the datafile
pl=[]
for xml in open('data/F1 Practice.txt', 'r'):
    pl.append(etree.fromstring(xml))

#Dataframe for current state timing screen
df_practice_pos=pd.DataFrame(columns=[
    "timestamp", "time",
    "classpos",  "classpos_colour",
    "racingNumber","racingNumber_colour",
    "name","name_colour",
],index=range(50))

#Column mappings
practiceMap={
    '1':'classpos',
    '2':'racingNumber',
    '3':'name',
    '4':'laptime',
    '5':'gap',
    '6':'sector1',
    '7':'sector2',
    '8':'sector3',
    '9':'laps',
    '21':'sector1_best',
    '22':'sector2_best',
    '23':'sector3_best'
}

def parse_practice(p,df_practice_pos):
    if p.attrib['identifier']=='101' and 'sessionstate' not in p[0].attrib:
        if p[0].attrib['column'] not in ['10','21','22','23']:
            colname=practiceMap[p[0].attrib['column']]
            row=int(p[0].attrib['row'])-1
            df_practice_pos.ix[row]['timestamp']=p.attrib['timestamp']
            tt=p.attrib['timestamp'].replace('.',':').split(':')
            df_practice_pos.ix[row]['time'] = datetime.time(int(tt[0]),int(tt[1]),int(tt[2]),int(tt[3])*1000)
            df_practice_pos.ix[row][colname]=p[0].attrib['value']
            df_practice_pos.ix[row][colname+'_colour']=p[0].attrib['colour']
    return df_practice_pos

for p in pl[:2850]:
    df_practice_pos=parse_practice(p,df_practice_pos)
df_practice_pos

(See the notebook.)

Getting sensible data structures at the timing screen level looks like it could be problematic. But to what extent are the feed elements meaningful in and of themselves? Each element in the feed actually has a couple of semantically meaningful data points associated with it, as well as the timestamp: the classification position, which corresponds to the row; and the column designator.

That means we can start to explore simple charts that map driver number against race classification, for example, by grabbing the row (that is, the race classification position) and timestamp every time we see a particular driver number:

racedemo

A notebook where I start to explore some of these ideas can be found here: racedemo.ipynb.

Something else I’ve started looking at is the use of MongoDB for grouping items that share the same timestamp (again, check the racedemo.ipynb notebook). If we create an ID based on the timestamp and row, we can repeatedly $set document elements against that key even if they come from separate timing feed elements. This gets us so far, but still falls short of identifying row based sets. We can perhaps get closer by grouping items associated with a particular row in time, for example, grouping elements associated with a particular row that are within half a second of each other. Again, the racedemo.ipynb notebook has the first fumblings of an attempt to work this out.

I’m not likely to have much chance to play with this data over the next week or so, and the time for making entries is short. I never win data competitions anyway (I can’t do the shiny stuff that judges tend to go for), but I’m keen to see what other folk can come up with:-)

PS The R book has stalled so badly I’ve pushed what I’ve got so far to wranglingf1datawithr repo now… Hopefully I’ll get a chance to revisit it over the summer, and push on with it a bit more… WHen I get a couple of clear hours, I’ll try to push the stuff that’s there out onto leanpub as a preview…

Written by Tony Hirst

July 2, 2014 at 10:38 pm

Posted in f1stats, Rstats

Tagged with ,

Recreational Data: Data Golf

I’m still hopeful of working up the idea of recreational data as a popular pastime activity with a regular column somewhere and a stocking filler book each Christmas (?!;-), but haven’t had much time to commit to working up some great examples lately:-(

However, here’s a neat idea – data golf – as described in a post by Bogumił Kamiński (RGolf) that I found via RBloggers:

There are many code golf sites, even some support R. However, most of them are algorithm oriented. A true RGolf competition should involve transforming a source data frame to some target format data frame.

So the challenge today will be to write a shortest code in R that performs a required data transformation

An example is then given of a data reshaping/transformation problem based on a real data task (wrangling survey data, converting it from a long to a wide format in the smallest amount of R.

Of course, R need not be the only language that can be used to play this game. For the course I’m currently writing, I think I’ll pitch data golf as a Python/pandas activity in the section on data shaping. OpenRefine also supports a certain number of reshaping transformations, so that’s another possible data golf course(?). As are spreadsheets. And so on…

Hmmm… thinks… pivot table golf?

Also related: string parsing/transformation or partial string extraction using regular expressions; for example, Regex Tuesday, or how about Regex Crossword.

Written by Tony Hirst

May 23, 2014 at 10:27 am

Posted in Data, Rstats, School_Of_Data

Tagged with

Follow

Get every new post delivered to your Inbox.

Join 843 other followers