A Further Look at the Orange Data Playground – Filters and File Merging

In Orange Visual Visualisation Tool, I posted some introductory notes on using the Orange “visual visualisation” tool. Whilst crossing the Solent last week, I had another little play and discovered a few more compelling features about this application.

First up relates to the problem of merging data sets with a common column, for example where you have a common identified appearing in two different data files (e.g. school, council or university ID appearing in different sorts of report). I’ve described several different approaches to this in the past, such as using Google Fusion Tables, for example, but none of them are ideal. But here’s how to do it the Orange way:

Merging data about a common column in Orange

Even though there appears to be only one input element to the Merge Tables widget, there are actually two…

Wiring in the examples connection

Here’s how we choose which columns to merge on:

Merging columns in Orange

Just to prove we’ve merged the data…

Merging data about a common column in Orange

Looking at the data, the original files include x and y columns that were used to represent scaled versions of other columns, headed using the Gephi reserved column names x and y.

It’s easy enough to remove these columns within the Orange environment – simply use the Select Attributes widget.

selecting columns in Orange

The column selection itself is made within the Select Attributes dialogue:

Column selection dialogue in Orange

As well as merging by column, it’s easy enough to concatenate data from several input files that share the same columns.

orange -concatenate widget

Looking at the dialogue for this wdget, we see it’s also possible to use it to merge data tables sharing common columns/attributes (such as a unique identifier), although where tables are joined with uncommon columns, unknown (?) values will be entered in cells for each column where the original data table did not contain that column.

As well as filtering whole columns out of a table using the Select Attributes widget, we can also filter rows based on one cell entries matching specified conditions within particular columns.

Orange - select data rows

It’s also worth mentioning that for large data sets, Orange can generate samples of your data for you:

Orange - data sampling

And finally, once you’ve manipulated your data set, you’ll probably want to save it? That’s wire it in easy too:

Orange - save data

So, pretty impressive, huh? And drop dead easy… or should that be: “click and wire” easy, particularly the data merge on a common column…:-)

Completely OUseless, of course…. If you’ve read this far, I apologise for wasting your time…

Orange Visual Visualisation Tool

A few days ago, I came across a drag’n’drop, wire it together visualisation and data analysis tool called Orange.

Here’s a quick run through of some of the basics (at least, a run through of the first few things I tried to do with the tool…)

First off, we need some data. Orange likes TSV (tab separated values) rather than CSV, so I grabbed some TSV from one of the Guardian Datastore spreadheets on Google Docs (use “Save as Text” to get the tab separated value format…)

TSV from google docs

Orange is a canvas based visual programming environment, in which functional blocks are added the the canvas and certain parameters set within the block. Here’s how we get some data into Orange from a TSV file:

Orangie viz tool - import data

The File icon is giving me a warning (no dependent variable) but I’m not sure why…? I’m sure Orange has managed to detect labels and quantities correctly from other files I’ve tried?

Anyway… we can inspect the data by looking at it in a data table widget – just wire one in:

Orange viz tool - data table

The table is sortable by column, and the Report button can be used to save a version of the table. Looking t the data table, we see it has identified columns with missing entries. We can clean these from out data set using the Preprocessing widget:

Orange - data cleaning

If we now wire the output of the Processing widget into the Scatterplot widget, we can generate a variety of scatterplots:

Orange scatterplot

If you want to save a copy of the chart, it’s easy enough to do so. (I can’t get colour palettes to work on my Mac, so I’m stuck with greyscale displays. Also, the blob sizing doesn’t seem very responsive…)

Orange - save a scatterplot

The Report tool allows us to create a report from various bits of the dataflow, including adding information from several widgets to either separate report pages or the same report page.

Orange - report generator

Saving a Report saves all the report pages to a navigable set of HTML pages that resemble the Orange Report viewer.

Here are a couple of other things we can do with the data, this time using a data set that isn’t throwing the “dependent variable missing” error, in particular the distribution of comments in a small Friendfeed network…

So for example, here’s how the number of comments made by members of the network is distributed:

Orange - distribution of values

Alternatively, we may look at the distribution in a more “statistical” way:

Orange - simple distributions

(Remember, we can generate these reports interactively, and then add them to a growing report.)

The survey plot gives us a macroscopic birds eye view over the whole of the data set:

Orange - survey plot

Okay, that’s enough for starters – hopefully you get the idea: wire stuff together and generate visual reports… So why not go and download Orange now?!;-)

There are a whole range of clustering tools, too, which look like they could be interesting…

And I think the platform is extensible, which means there’s a way of adding your own widgets (written in Python, maybe..?)