OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for May 2011

A Bit of NewsJam MoJo – SocialGeo Twitter Map

with 7 comments

At the Mozilla/Knight/Guardian/BBC NewsJam #mojo event on Saturday (review by Martin Belam; see also Martin’s related review of the news:rewired event the day before), I was part of a small team that looked at how we might start validating folk tweeting about a particular news event. Here’s a brief write up of our design…

Try it here: SocioGeo map

When exploring twitter based activity around an event, Guardian journalist Paul Lewis raised the question “how does a journalist know which sources are to be trusted?” (a source verification problem), identifying this as one area where tool innovation may be able to help the journalist assessing which twitter sources around an event may be worth following or contacting directly.

The SocioGeo map begins to address this concern, and represents an initial attempt at mapping the social and geographical spread of tweets around an event in near real time. In its first incarnation, SocioGeoMap is intended to support visual analysis of the social and spatial distribution of people tweeting about an event in order to identify the extent to which people tweeting about an event are co-located with it/and or each other (initially, based on a sampling of geocoded tweets, although this might extend to reconciliation of identities from Twitter into location based checkin services such as Foursquare, or derived location services such as uploaded geocoded photos to Flickr), and the extent to which they know each other (initially, this is limited to whether or not they are following each other on Twitter, but could be extended to other social networks).

In his presentation at the #mojo London event, Guardian interactive designer Alastair Dant suggested a fruitful approach for hacks/hackers communities might be to identify publication “archetypes” such as maps and timelines, as well as “standard content types” such as map+timeline combinations. To these, we might add the “social network” archetype, and geo-social maps (locating nodes in space and drawing social connections between them), socio-temporal maps (showing how social connections ebb and flow over time, or how messages are passed between actors) or geo-socio-temporal maps (where we plot how messages move across spatially and socially distributed nodes over time.

If the simple geo-social map depiction demonstrated above does turn out to be useful, informative or instructive, the next phase might be to start using mathematical analyses of the geographical concentration of people tweeting about an event, as well as social network analysis metrics to start assigning certainty factors to individuals relating to the degree of confidence we might have that they were eyewitness to an event, embedded within it/central to it or a secondary/amplifying source only, and so on. A wider social network analysis (eg of the social networks of people associated with an event) might also provide information related to the authority/trustedness/reputations of the source in other contexts. These certainty factors might then be able to rank tweets associated with an event, or identify sources who might be worth contacting directly, or ignoring altogether. (That is, the analyses might be able to contribute to automatic filter configuration).

SocioGeoMap is based on several observations:

  • that events may occur in a physical location, or virtual online space, or a combination of the two;
  • that people tweeting about an event may or may not be participating in it or eyewitnesses to it (if not, they may be amplifying for direct or indirect reasons (indirect reasons might be where the retweeter is not really interested in the event, but was interested in amplifying the content of a tweet that also mentioned the event); we might associate a certainty factor with the extent to we believe a person was a participant in, or eyewitness to an event, whether they were rebroadcasting the event as a ‘news service”, whether they were commenting on the event, or raising a question to event participants, and so on;
  • that people tweeting about an event may or may not know each other.

Taking the example of football match, we might imagine several different co-existing states:

  • a significant number of people co-located with the event (and eyewitnesses to it); small clusters of these people may be tightly interconnected and follow each other (for example, social groups who visit matches together), some clusters that are weakly associated with each other via a common node (for example, different follower groups of the same team following the same football players), large numbers of people/clusters that are independent).
  • a very large number of people following the event through a video or audio stream but not co-located with it; it is likely that we will see large numbers of small subnetworks of people who know each other through eg work but who also share an interest in football;

In the case of a bomb going off in a busy public space, we might imagine:

  • a small number of people colocated with the event and largely independent of each other (not socially connected to each other)
  • a larger number of people who know eyewitnesses and retweet the event;
  • people in the vicinity of the event but unaware of it, except that they have been inconvenienced by it in some way;
  • people unconnected to the event who saw it via a news source or as a trending news topic and retweet it to feel as if they are doing their bit, to express shock, horror, pity, anger, etc

SocioGeoMap helps visualise the extent to which twitter activity around an event is either distributed or localised in both social/social network and geographical spaces.

In its current form, SocioGeoMap is built from a couple of pre-existing services:

  • a service that searches the recent twitter stream around a topic and identifies social connections between people who are part of that stream;
  • a service that searches the recent twitter stream around a particular location (using geo-coded tweets) and renders them on an embeddable map</li

In its envisioned next generation form, SocioGeoMap will display people tweeting about a particular topic by location (i.e. on a map) and also draw connections between them to demonstrate the extent to which they are socially connected (on Twitter, at least).

SocioGeoMap as currently presented is based on particular, user submitted search queries that may also have a geographical scope. An extention of SocioGeoMap might be to create SocioGeoMap alerting dashboards around particular event types, using techniques similar to the techniques employed in many sentiment analysis tools, specifically the filtering of items through word lists containing terms that are meaningful in terms of sentiment. The twist in news terms is to identify meaningful terms that potentially relate to newsworthy exclamations (“Just heard a loud explosion”, “goal!!!!”, “feel an earthquake?” and so on), and rather than associating positive or negative sentiment around a term brand, trying to discover tweets associated with sentiments of shock or concern in a particular geographical location.

SocioGeoMap may also be used in associsation with other services that support the pre-qualification or pre-verification of individuals, or certainty measure estimates on their expertise or likelihood of being in a particular place at a particular time. So for example, in the first case we might imagine doing some prequalification work around people likely to attend a planned event, such as a demonstration, based on their public declarations (“Off to #bigDemo tomorrow”), or identify their remote support/interest in it (“gutted not to be going to #bigDemo tomorrow”). Another example might include looking for geolocated evidence that an individual is a frequenter of a particular space, for example through a geo-coded analysis of their personal twitter stream and potentially also at one remove, such as through a geocoded analysis of their friends’ profiles and tweetstream, and as a result derive a certainty measure about the association of an individual with a particular location; that is, we could start to assign certainty measure to the likelihood of their being an eyewitness to an event in a particular locale based on previous geo-excursions.


By: Tony Hirst (@psychemedia), Alex Gamela (@alexgamela), Misae Richwood (@minxymoggy)
Mozilla/Knight/Guardian/BBC News Jam, Kings Tower, London, May 28th, 2011 #mojo

Implementation notes:

The demo was built out of a couple of pre-existing tools/components: a geo-based Twitter search constructed using Yahoo Pipes (Discovering Co-location Communities – Twitter Maps of Tweets Near Wherever…); and a map of social network connections between folk recently using a particular search term or hashtag (Using Protovis to Visualise Connections Between People Tweeting a Particular Term). It is possible to grab a KML URL from the geotwitter pipe and feed it into a Google map that can be embedded in a page using an iframe. The social connections graph can also be embedded in an iframe. The SocialGeoMap page is a page that contains two iframes, one that loads the map, and a second that loads the social network graph. The same data pulled from the Yahoo geo-search pipe feeds both visualisations.

In many cases, several tweets may have the exact same geo-coordinates, which means they are overlaid on the map and difficult to see. to get around this, a certain amount of jitter is added to each latitude and longitude; because Yahoo Pipes doesn’t have a native random number generator, I use a tweet ID to generate a jitter offset using the following pipe:

This is called just before the output of the geotwitter search pipe:

Whilst this does mean that no points are plotted with their exact original co-ordinates, it does mean that we can separate out most of the markers corresponding to tweets with the same latitude and longitude and thus see them independently on the map at their approximate location.

A next step in development might to move away from using Yahoo pipes, (which incur a cacheing overhead) and use server side service. A quickstart solution to this might be to generate a Python equivalent of the current pipework using Greg Gaughan’s pipe2py compiler, that generates a Python code equivalent of a Yahoo pipe.

Written by Tony Hirst

May 30, 2011 at 5:40 pm

Google Correlate: What Search Terms Does Your Time Series Data Correlate With?

with 2 comments

Just a few days over three years ago, I blogged about a site I’d put together to try to crowdsource observations about correlated searchtrends: TrendSpotting.

One thing that particularly interested me then, as it still does now, was the way that certain search trends they reveal rhythmic behaviour over the course of weeks, months or years.

At the start of this year, I revisited the topic with a post on Identifying Periodic Google Trends, Part 1: Autocorrelation (followd by Improving Autocorrelation Calculations on Google Trends Data).

Anyway today it seems that Google has cracked the scaling issues with discovering correlations between search trends (using North American search trend data), as well as opening up a service that will identify what search trends correlate most closely with your own uploaded time series data: Correlate (announcement: Mining patterns in search data with Google Correlate)

For the quick overview, check out the Google Correlate Comic.

So what’s on offer? First, enter a search term and see what it’s correlated with:

As well as the line chart, correlations can also be plotted as a scatterplot:

You can also run “spatial correlations”, though at the moment this appears to be limited to US states. (I *think* this works by looking for search terms that are popular in the requested areas and not popular in the other listed areas. To generalise this, I guess you need three things: the total list of areas that work for the spatial correlation query; the areas you want the search volume for the “to be discovered correlated phrase” to be high; the areas you want to the search volume for the “to be discovered correlated phrase” to be low?)

At this point it’s maybe worth remembering that correlation does not imply causation…

A couple of other interesting things to note: firstly, you can offset the data (so shift it a few weeks forwards or backwards in time, as you might do if you were looking for lead/lag behaviour); secondly, you can export/download the data.

You can also upload your own data to see what terms correlate with it:

(I wonder if they’ll start offering time series analysis features on uploaded, as well as other trend data, too? For example, frequency analysis or trend analysis? This is presumably going on in the background (though I haven’t read the white paper [PDF] yet…)

As if that’s not enough, you can also draw a curve/trendline and then see what correlates with it (so this a weak alternative to uploading your own data, right? Just draw something that looks like it… (h/t to Mike Ellis for first point this out to me).

I’m not convinced that search trends map literally onto the well known “hype cycle” curve, but I thought I’d try out a hype cycle reminiscent curve where the hype was a couple of years ago, and we’re now maybe seeing start to reach mainstream maturity, with maybe the first inklings of a plateau…

Hmmm… the pr0n industry is often identified as a predictor of certain sorts of technology adoption… maybe the 5ex searchers are too?! (Note that correlated hand-drawn charts are linkable).

So – that’s Google Correlate; nifty, eh?

PS Here’s another reason why I blog… my blog history helps me work out how far i the future I live;-) So currently between about three years in the future.. how about you?!;-)

PPS I can imagine Google’s ThinkInsights (insight marketing) loving the thought that folk are going to check out their time series data against Google Trends so the Goog can weave that into it’s offerings… A few additional thoughts leading on from that: 1) when will correlations start to appear in Google AdWords support tools to help you pick adwords based on your typical web traffic patterns or even sales patterns? 2) how far are we off seeing a Google Insights box to complement the Google Search Appliances, that will let you run correlations – as well as Google Prediction type services – onsite without feeling as if you have to upload your data to Google’s servers, and instead, becoming part of Google’s out-kit-in-your-racks offering; 3) when is Google going to start buying up companies like Prism and will it then maybe go after the likes of Experian and Dunnhumby to become a company that organises information about the world of people, as well as just the world’s information…?!)

PPPS Seems like as well as “traditional” link sharing offerings, you can share the link via your Google Reader account…

Interesting…

Written by Tony Hirst

May 25, 2011 at 5:57 pm

On the Public Understanding of – and Public Engagement With – Statistics: Reflections on the OU Statistics Group Conference on “Visualisation and Presentation in Statistics”

with 6 comments

Last week I attended the OU Statistics conference on Visualisation and Presentation in Statistics (VIPS) (notes: here and here)

One of the things that struck me from conversations and some of the presentations was that statistics – and in particular public engagement around statistics – appears to be lagging science efforts in this area.

When I first moved to the OU as a lecturer a dozen or so years ago, I got involved with various activities that, at the time, were classed as “public understanding of science and technology”, though at the time the whole sci-comm area was in a state of flux and ideas were moving towards a focus on public engagement with science. As a member of the NESTA Crucible one year, I saw how there was also concern around engagement with science and technology policy, and how it could be moved “upstream”, to a point where dialogue with various publics could actually contribute to, and even influence, policy development.

(The NESTA Crucible experience significantly influenced my world view and was one of the most rewarding schemes I have ever been involved with…)

Since then, it seems to me that the school science curriculum has witnessed a similar change, with a move away from a focus purely on the basic science (and perhaps industrial applications?) to one that includes a consideration of socio-technical considerations (one might say, policy implications…)

At the VIPS event, one of the phrases that jumped out at me in at least one presentation (aside from repeated mentions to RSS…;-) talked about difficulties in promoting the public understanding of statistics. Ally this with the fact that the school maths curriculum seems not to have evolved so much, (“averages”, means and histogram still seem to be the focus?!) and I wonder: is statistics today where science was a decade or so ago?

The recent rhetoric around – and actual release of – “open public data” suggests that, as citizens and journalists, there is an increasing number of opportunities to hold governments and public bodies to account using evidentiary data and maybe also engage in data-driven (or at least data informed) policy formulation. With so much data out there, and so many possible ways of combining and interrogating it – so many possible different questions to ask and places to ask them – there are increasingly opportunities for informed amateurs to make a very real contribution (in the same way that amateur astronomers can make a real contribution to the recording and analysis of astronomical observations).

The growing instrumentation of our world also means that there is increasing amounts of data about ourselves that we can have access to in the form of personal data dashboards (for example, think of various social media/reputation tools, but also expect to see various tools appearing that allow you to mine your health/fitness, financial or shopping transaction data, for example). These dashboards will be visually rich, and designed to give at-a-glance overviews of the state of this, or that quantity or metric. But to get most from them, we will need to include more complex and powerful visualisation types, and find a way of helping people learn how to “see” them, “read” them and interpret them/

So to what extent do we need to engage with the “public understanding of statistics” as compared to the development of skills in the public appreciation of statistics and improvements in the way the public can engage with each other and with policy makers in discussions where statistics play a role? (Public engagement in statistics? Public engagement with statistics?)

Over the last few weeks, I’ve started trying to immerse myself in the world of statistical graphics, on the basis that our perceptual apparatus is pretty good at pattern detection and can help us get to grip with visually meaningful properties of distributions of data without us necessarily having to understand much in the way of formal statistics. (Of course, the visual apparatus can also be conned by misleading graphs and charts, which is where some semblance of critical understanding and, dare I say it, statistical literacy, comes in.)

My intuition is that it will be easier to develop a visual literacy in the reading and interpretation of charts (i.e. building on “folk statistical graphics/visual statistics”) than a widespread mathematical understanding of statistics. (I suspect that for most people, pie charts – and more recently ‘donut’ charts – as well as line graphs and simple bar charts are about the limit of what they are comfortable with, along with thematic maps (in particular, choropleth maps) and (in recent years again?) proportional symbol maps. I also know from asking even well informed audiences that awareness of more recently developed techniques, such as treemaps, are not widespread.)

At the moment, the infographics designers appear to be leading the charge into public consciousness of data-driven graphics, but as I’m finding out, the stats community has a wealth of visual techniques already to hand that are maybe “sounder” in terms of deriving visual representations that reflect statistical properties and concerns than the tricks the infographics crowd are using. (This is all just my anecdotal opinion, and not based in any formal research!)

Many infographics build on a common visual grammar (in the West, line charts up to the right increase over time; for area based charts, the bigger the area the more of something is being represented). But many infographics are also limited by the chart types we are all familiar with (line charts, bar charts, coloured maps…) Maybe the place to start is the stats community finding ways of introducing new-to-the-majority statistical graphs into the mainstream media along with a strong narrative to explain what is going on in those charts (and not necessarily so much discussion about the actual maths and stats…)?

Written by Tony Hirst

May 24, 2011 at 1:56 pm

A Noticing…about e-mail…

with 4 comments

Anyone who knows me knows that email is not the best way of getting in touch with me…

Anyway…

Earlier today, I received a tweet telling me I’d been sent an email (the email was only a couple of lines long… it would have probably fitted into a tweet…) but without too much of a hint about what it contained… (okay, I exaggerate: there was a hint; but the story’s better this way;-)

Because I hold email in low regard, I felt no urgency to check it…

Checking my Google docs account just by the by (it’s space I visit quite frequently, typically several times a day), I noticed the arrival of a new document that had been shared with me… I had a look at the document, and made a couple of comments…

Then I got another tweet asking if I had got the message I’d been sent earlier..?

I assumed that this was about the document I had commented on…

So we have a mismatch in expectations. One scenario goes like this:

- someone shares a doc with me
- I open it and comment on it. My change means the doc is updated as “modified/unread” in other folks’ document listings.

Another goes like this:
- someone shares a doc with me
- they email me to let me know
- I pick up the email, read it, and note that there’s a doc I need to attend to
- I open the doc and comment on it.

Or like this:
- someone shares a doc with me
- they email me to let me know
- I pick up the email, read it, and note that there’s a doc I need to attend to
- I respond to the email (“ok- thanks, will check”)
- I open the doc and comment on it.

And another:
- someone shares a doc with me
- they email me to let me know
- they send me a tweet to let me know about the email
- I pick up the tweet (“ok – thanks, will check”, as a tweet)
- I pick up the email, read it, and note that there’s a doc I need to attend to, then reply by email (“ok, thanks, will check”, as an email)
- I open and comment on the document.

It may also goes like this:
- someone shares a doc with me
- they email me to let me know
- they send me a tweet to let me know about the email
- I pick up the tweet (“ok – thanks, will check”, as a tweet)
- I pick up the email, read it, and note that there’s a doc I need to attend to, then reply by email (“ok, thanks, will check”, as an email)
- I open and comment on the document
- I send an email and/or a tweet saying I have commented on the doc…

So it goes…

Written by Tony Hirst

May 23, 2011 at 9:41 pm

Posted in Anything you want, OU2.0

Tagged with

Whose Investor Relations Sites Do Thomson Reuters Host? A Form of URL Hacking…

with 2 comments

A quick little infoskills demo using Google search…

I very rarely do more then skim the headline of posts in my feed from Techcrunch, but today I actually opened up a post about Amazon buying another imprint (Amazon Expands To Mysteries And Thrillers With Fifth Publishing Imprint, Thomas & Mercer). As with a lot of TechCrunch posts, it’s pretty much just a rebranding of a Press Release, though to their credit Techcrunch linked to the “original” source: www.businesswire.com/news/home/20110518005494/en/Amazon-Launches-Publishing-Imprint-Thomas-Mercer

…insofar as a wire service copy of a press release is an original source… This got me wondering whether the press release had also appeared on a more direct Amazon press release page…?

Googling for press release amazon on google.co.uk turned up a way in to Amazon’s UK media relations site: www.amazon.co.uk/gp/press/pr/20080710 (try hacking around the URL to find an amazon.com equivalent…) as well a page on the central US site: phx.corporate-ir.net/phoenix.zhtml?ID=1565581&c=176060&p=irol-newsArticle, a third party (?) hosted site with Amazon branding: phx.corporate-ir.net/phoenix.zhtml?p=irol-mediaHome&c=176060. (One of the oft-taught infoskills tips is not to necessarily trust a site where the domain in the URL doesn’t appear to fit…) Note: I also got redirected to the main page of the Amazon-looking-but-not-on-an-Amazon URL page through trying amazon.com/pr.

Seeing who hosts this site – that is, just trying phx.corporate-ir.net – we get a redirect to www.ccbn.com which immediately then hops to thomsonreuters.com/products_services/financial/financial_products/corporate_services/investor_relations/

So Thomson Reuters, maybe… (If we were doing a proper job, we’d be looking up internet domain registrations, but I’m a trusting sort.. If you’re interested, the service you need to use is called whois and then look for the registrant. For example, www.networksolutions.com/whois-search/corporate-ir.net;-) But who else do they run this service for?

If you Google inurl:phx.corporate-ir.net/phoenix.zhtml you can spot results from various sources. Looking through the URLs, many of the pages are hosted on (hosted) investor relations sites. The p= argument specifies the page, the c= argument looks like it might identify a company.

We can try blindly hacking company numbers to see what companies turn up, or we can be a little more structured, for example by finding the investor relations homepage value (p=irol-irhome looks a good bet) and Googling on inurl:http://phx.corporate-ir.net/phoenix.zhtml? inurl:p=irol-irhome site:phx.corporate-ir.net.

The results pages are then full of sites of investor relations hosted by Thomson Reuters for third parties.

(You might think that the use of site: is redundant given the first inurl: limit. I hit upon using it like that as a result of clicking on the “More results form this site” option from one of my initial search attempts… It seems to override the collapsing of multiple results on a single domain…)

If you want to tunnel down to press release/media relations sites, there seem to be a variety of landing pages (p=irol-news, p=irol-press), so instead we can fudge a bit and search for “press” in the page title: inurl:http://phx.corporate-ir.net/phoenix.zhtml? intitle:press site:phx.corporate-ir.net

If you live and work the web, being able to read URL, and use that knowledge to help you search URL, is a handy skill to have…

PS for another take on “website analysis” journalism, see @pubbbox: UKTI’s TechCityUK site: 100 WordPress pages, £53k

Written by Tony Hirst

May 19, 2011 at 6:10 pm

Reshaping Your Data – Pivot Tables in Google Spreadsheets

leave a comment »

One of the slides I whizzed by in my presentation at the OU Statistitcs conference on “Visualisation an Presentation in Statistics” (tweet-commentary notes/links from other presentations here and here) relates to what we might describe as the shape that data takes.

An easy way of thinking about the shape of a dataset is is to consider a simple 2D data table with columns describing the properties of an object and rows typically corresponding to individual things. Often a regular structure, each cell in the table may take on a valid value. Occasionally, some cells may be blank, in which case we can start to think of the shape of the data getting a little bit ragged.

If you are working with data table, then on occasion you may want to swap the rows for columns (certain data operations require data to be organised in a particular way). By swapping the rows and columns, we change the shape of the data (for example, going from a table with N rows and M columns to one with M columns and N rows). So that’s one way of reshaping your data.

Many visualisation tools require data to be in a particular shape in order for the data to be visualised appropriately. If you look at the template pages on Number Picture, a new site hosting templated visualiastions built using Processing that allow you to cut, paste and visualise data – if it is is appropriately shaped – at a click.

But where do pivot tables come in? One way is to think of them as a tool for reshaping your data by providing summary reports of your original data set.

Here’s how the Goog describes them:

What pivot tables allow you to do is generate reports based on contents of a table using the values contained within one or more columns to define the columns and rows of a summary table. That is, you can re-present (or re-shape) a table as a new table that summarises data contained in the original table in terms of a rearrangement of the cell values of the original table.

Here’s a quick example. I have a data set that identifies the laptimes of drivers in an F1 race (yes, I know… again!;-) by stint, where different stints are groups of consecutive laps separated by pit stops.

If you look down the stint column you can see how its value groups together blocks of rows. But how do I easily show how much time each driver (column C) spent on each stint? The time the driver spent on each stint is the sum of laptimes by driver within a stint, so for each driver I need to find out the laps associated with each stint, and then sum them. Pivot tables let me do that. Here’s how:

So how does this work? The columns in the new table are defined according to the unique values found in the stint column of the original table. The rows in the new table are defined according to the unique values found in the car column of the original table. The cell values in the new table for a given row and column are defined as the sum of lapTime data values from rows in the original table where the car and stint values in the row correspond to the row and column values corresponding to each cell in the new table. Got that?

If you’re familiar with databases, you might think of the column and row settings in the new table defining “keys” into rows on the original table. The different car/stint pairs identify different blocks of rows that are processed per block to create the contents of the pivot table.

As well as summing the values from one column based on the values contained in two other columns, the pivot table can do other operations, such as counting the number of rows in the original table containing each “key” value. So for example, if we want to count the number of laps a car was out for by stint, we can do that simply by changing out pivot table Values setting.

Pivot tables can take a bit of time to get your head round… I find playing with them helps… A key thing to remember is: if you want to express a dataset in terms of the unique values contained within a column, the pivot table can help you do that. In the above example, I was essentially generating the row and column values for a new table based on categorical data (driver/car number and stint number). Another example might be sales data where the same categories of item appear across multiple rows and you want to generate reports based on category.

Written by Tony Hirst

May 19, 2011 at 10:30 am

Quick Summary of Second and Third Sessions of “Visualisation and Presentation in Statistics”

with 2 comments

Kevin McConway ( http://statistics.open.ac.uk/People/k.j.mcconway @kjm2 ): showing off some gratuitous use of numbers to illustrate Guardian stories #ouvpstats
Where do surveys reported in the press come from? ONS, market research companies. PR companies…… #ouvpstats
Get paid to do a (PR?) survey onepoll.com and youngpoll.com #ouvpstats
Not PR commissioned polls, err, maybe, err, hmmm…. http://72point.com/ #ouvpstats
Why are there numbers in the news? PR, Entertainment, eyecandy. Special status of “number facts” #ouvpn
Mary Poovey “A History of the Modern Fact” http://www.press.uchicago.edu/ucp/books/book/chicago/H/bo3614698.html #ouvpn
Need to distinguish between facts, analysis and narrative… #ouvpstats
What’s wrong with PR stats? ’tis the road to cynicism, or looking good rather than communicating well #ouvpstats
So what can we do about it? Statisticians need to engage with the public and work with journalists #ouvpstats
Statisticians’ view of journalists: innumerate, distort and oversimplify, don’t understand quantitative reasoniong, won’t listen #ouvpstats
Journalists’ view of statisticians: illiterate pedantic, boring, focus on ifs and buts, won’t listen #ouvpstats
Journalists work to tight timescales, have a view of “newsworrthiness”, are good storytellers #ouvpstats

Martin Bland ( https://hsciweb.york.ac.uk/research/public/Staff.aspx?ID=129 )
From papers during one issue from 1972 and 2010 Lancet and BMJ, mean population size has gone up 2-3 orders of magniture (tens to thousands+
Description of stats: very cursory, 2010: far more comprehensive statistical method reported. Shift from significance testing to estimation
Move towards evidence-based medicine starting around 1990s (bound to includes statistics)
“Why do we need some large, simple randomized trials?” Yusuf et al. 1984
Move to confidence intervals not p-values Gardner & Altman http://www.bmj.com/content/292/6522/746.abstract
Journals started to introduce systematic requires and statistical referees
Consort guidelines for stats in randomised medical trials http://www.consort-statement.org/
Statisticians should point out where wrong conclusions have been drawn as a results of stats mistakes…

Rosemary Bailey http://www.maths.qmul.ac.uk/~rab/
Problems with box and whisker plots (referred to as box and aerial/antenna plot?), which are now popular in medicine, biology, engineering (not least becuase folk don’t know what the whisker means). Antenna doesn’t take into account variability across conditions. [My naive understanding of these diagrams is that they are trying to say something different? But my knowledge is so hazy I can't argue for what I do think they describe!]
Hasse diagrams – cords, dyes and constants(?) [I'm a bit lost at this point...]

Michel van de Velden http://www.erim.eur.nl/ERIM/People/Person_Details?p_aff_id=799
Perceptual maps – mutltivariate methods for plotting high-dimensional data
Exploit natural spatial recognition/visual abilities
Examples: Tufte 1983 cleveland and McGill 1987, Wainer 2005
Caption should convey enough info to allow reader in possession of data (and appropriate tools) to recreate the perceptual map
Shape paramter (aspect ratio) – ratio of x scale to y scale. If it can be 1, it should be… (changes aspect ratio of photo of Kate Middleton to make the point about distortion if not 1 when it could/should be…)
If perception of map relies in part on angle of point/line, need to know where the origin is.
Excel charts – hard to explicity set an exact aspect ratio (same with many tools?)
Perceptual maps may require guidance as to how to read a map – e.g. icons http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1572196

[Me]

Jill Leyland, Vice President, Royal Statistical Society
Lots of folk think UK official statistics are not free of politcal interference, nor do they necessarily trust(?) them, scores very poorly compared to rest of Europe.
National Stats have high integrity and free of political interference. Perception of political interference is one reason why low degree of trust. UKSA (UK Statistics Authority) scrutinises official statistics: “promoting and safeguarding the porduction and publication of statistics that serve the public good”
No politicial interference, but: many key stats produced in depts, UKSA role not fully understood (scrutineer as well as publisher); pre-release access – Ministers can see statistics 24 hrs before they are released (up to 5 days in Scotland and Wales), and suspicion that Ministers may use this time for mischief…
Role of media – UK media are interested in statistics, but “stats are wrong” stories get more covereage than “stats are right”, and journalists often don’t understand statistical issues (as well as tight deadline, no specialist knowledge). BUT official statisticians could do better; ONS website a joke… (though new one due to launch at end of August). Far too little interaction with stats users outside government.
What can be done? Continuing efforts to improve presentation; need to differerntiate between independent national statistics and those produced by departments. Better education for journalists [and statisticians eg ito communications?]; reduction/elimination of pre-release access.

Written by Tony Hirst

May 18, 2011 at 4:26 pm

Posted in Anything you want

Tagged with

My Presentation at OU Statistics Conference – Visualisation Tools for the Rest of Us

with 2 comments

Slides from my presentation to the OU Visualisation and Presentation in Statistics earlier today… will update this post with notes and links as an when I get round to it! In the meantime, you’ll have to use Google…(though other search engines are available). (Slodes via Slideshare)

Written by Tony Hirst

May 18, 2011 at 4:14 pm

Quick Summary of Opening Session of “Visualisation and Presentation in Statistics”

with 3 comments

From @Flygirltwo: @psychemedia What’s the event? #ouvpstats Looks really interesting, especially the identified need to tell good stories around the stats.
.@Flygirltwo event is “Visualisation and Presentation in Statistics” http://bit.ly/kQyBKW #ouvpstats

John Gower (http://statistics.open.ac.uk/People/j.c.gower) availability of enticing viz tools may pose risks/danger of “improper use” #ouvpstats
Can use diagrams to help yourself come to some understanding of something, but image not necessarily useful to others,
Tools okay when used properly, but they get much “improper use”/misused
Wrote a paper on this but can’t get it published….
Problems with public understanding of statistics

Michael Blastland ( http://bbc.in/mbl9eZ ) now on… #moreorless #ouvpstats “Numbers go up, and down…” http://bbc.in/jeXvHl
Encouraging people to play their way to an understanding BUT people don’t know what questions to ask/test http://bbc.in/ln3Czz #ouvpstats
Another stats game (hospital death chance calculator) but it didn’t really work.. Storytelling is missing http://bbc.in/jh95gM #ouvpstats
Need narrative arc to help people make sense of (their use of) interactive viz. Playing gapminder animation, with no Hans Rosling #ouvpstats
Imposition of narrative means experts can say “Hmm, I’m not sure I agree with that…” #ouvpstats
Office for National Statistics is great, but little use to public. ONSstats on Youtube helps address this http://bit.ly/jLHp6B #ouvpstats
ONSstats on youtube uses narrative to help explain the stats… http://bit.ly/jLHp6B #ouvpstats
Most people don’t necessarily get much out of charts with wiggly lines #ouvpstats
Need an understanding of the underlying issues in order to engage with interactive data visualisation meaningfully #ouvpstats
Does imposition of narrative destroy opportunities for open-ended exploration eg with interactive visualisations? #ouvpstats
Qn: many statisticians come from maths background; need interdisplinary team of eg storytellers and designers…? A: Yes #ouvpstats
Fundamental dilemma: how to throw topic open to curiosity whilst providing narrative way in? #ouvpstats

From @JackieCarter: Discussion at #ouvpstats music to my ears. See http://bit.ly/lL4iBE and links from it to lots of work on this at Mimas
From @PhilDRoberts: Following #ouvpstats from my desk, gutted not to be there, have added to the Archivist see http://bit.ly/jS7tX6

Next up: John Aldrich http://bit.ly/mQAF2Z #ouvpstats
Victorian statistics… 1838 “collection and comparison of Facts which illustrate the condition of mankind” #ouvpstats
History of Victorian stats graphics by Funkhouser “Historical Development of the Graphical Representation of Statistical Data” #ouvpstats
First diagram in Statistical Journal was a line diagram in 1841 by Daniel Griffin, Limerick Literary and Scientific Society #ouvpstats
Great figure of Victorian statistics – William Farr http://bit.ly/kJdXpm – on occasion, did pictures “for special reports” #ouvpstats
Farr – “temperature and mortality of London” time series eye candy http://bit.ly/mcfB2E #ouvpstats Also reported on Crimea War…
… as did Florence Nightingale #ouvpstats Nightingale’s rose etc http://bit.ly/lzEBLO “Statistical aesthetics lagging behind”
Farr’s and Nightingale’s diagrams recognised by reviewers as remarkable but never became part of standard fare of communications #ouvpstats
First economic diagram in Statistical Journal 1847 John Towne Danson (“journalist”) – stats since passing of 1844 Bank Act #ouvpstats
W. Stanley Jevon’s statistical atlas 1860 #ouvpstats
Mulhall’s dictionary of statistics – full of pictograms BUT “real statisticians don’t do diagrams” #ouvpstats
New breed of statistician in 1890s who did make more of diagrams #ouvpstats
Visual adventurousness of mid 1800s did not become routine but time series diagrams did become routine #ouvpstats

From @agdturner: @psychemedia I wonder if you know about the work of John Snow: http://en.wikipedia.org/wiki/John_Snow_(physician) #ouvstats
From @agdturner: @psychemedia And then developing on this Stan Openshaw: http://en.wikipedia.org/wiki/Stan_Openshaw #ouvstats
@agdturner yes – but talk was more on history of charts/graphs rather than geo #ouvpstats

Next up: David Spiegelhalter on visualising risk/uncertainty http://bit.ly/inRfP8 h/t to McCandless, Fry… #ouvpstats
Quick typography of probability… decimal, fraction, percentage, odds, line/bar/chart etc, static/dynamic #ouvpstats
Relevance of positive/negative framing: “1% of blah” vs “99% of blah #ouvpstats
“Icon arrays .. generally considered quite nice” [but...?!?] http://1.usa.gov/lrPHix #ouvpstats
Would Nightingale have used animation of flash had been around? Animation: http://understandinguncertainty.org/nightingale
Visual football stats/predictions kickoff.co.uk Dangers using circles (area/angles). #ouvpstats
Fox: rubbish pie chart http://bit.ly/lvVpwn #ouvpstats
“Icon arrays .. generally considered quite nice” [but...?!?] http://1.usa.gov/lrPHix #ouvpstats
Most psychology experiments “sadly small” (“and what is an f-test anyway…!?”) #ouvpstats
Simplified “x in y” language generally deemed bad practice (changing denominator then changing numerator) #ouvpstats
Don’t use probabilities in teaching probability trees… use eg a population of 100 or 1000 to develop the intuition #ouvpstats
Assessing risk/uncertainty in screening tests – animating false positives http://bit.ly/leNd83 #ouvpstats
“Cone of uncertainty” in eg hurricane forecasts #ouvpstats What’s wrong with these? eg http://journals.ametsoc.org/doi/abs/10.1175/BAMS-88-5-651

Written by Tony Hirst

May 18, 2011 at 11:43 am

Posted in Anything you want

Tagged with

Data Driven Journalism – Survey

with one comment

The notion of data driven journalism appears to have some sort of traction at the moment, not least as a recognised use context of some very powerful data handling tools, as Simon “Guardian Datastore” Rogers appearance at Google I/O suggests:


(Simon’s slot starts about 34:30 in, but there’s a good tutorial intro to Fusion Tables from the start…)

As I start to doodle ideas for an open online course on something along the lines of “visually, data” to run October-December, data journalism is going to provide one of the major scenarios for working through ideas. So I guess it’s in my interest to promote this European Journalism Centre: Survey on Data Journalism to try to find out what might actually be useful to journalists…;-)

[T]he survey Data-Driven Journalism – Your opinion aims to gather the opinion of journalists on the emerging practice of data-driven journalism and their training needs in this new field. The survey should take no more than 10 minutes to complete. The results will be publicly released and one of the entries will win a EUR 100 Amazon gift voucher

I think the EJC are looking to run a series of data-driven journalism training activities/workshops too, so it’s worth keeping an eye on the EJC site if #datajourn is your thing…

PS related: the first issue of Google’s “Think Quarterly” magazine was all about data: Think Data

PPS Data in journalism often gets conflated with data visualisation, but that’s only a part of it… Where the visulisation is the thing, then here’s a few things to think about…


Ben Fry interviewed at Where 2.0 2011

Written by Tony Hirst

May 13, 2011 at 12:56 pm

Follow

Get every new post delivered to your Inbox.

Join 126 other followers