OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for January 2011

Open University Undergraduate Module Map

with 2 comments

Picking up on data.open.ac.uk Linked Data Now Exposing Module Information, which describes how to query the Open University linked data store for course (that is, module) information, I’ve just posted an SVG map of how all the current* OU undergraduate course relate.

* if I understood @mdaquin correctly… specifically, a course is current if a description field is available for it.

You can see the graph here or by clicking through on the image below (if you’re using at least Firefox, Safari or Chrome, you should be able to click and drag the image to move it around, as well as zoom in and out.

The links between courses are the ‘related to’ links contained within the linked data. The nodes are sized according to degree and coloured according to modularity group, following application of the Gephi modularity statistic. The layout is an expanded form of a Yifan Hu layout.

The modularity statistic seems to identify clusters of courses reasonably well, allowing a student of potential student to get an overall view of the courses offered by the OU along with the courses that are naturally taken together. It may be interesting to explore the extent to which this sort of view may be used as a navigation surface, or made a little more interactive, for example by displaying information about a course (maybe including price and start date?!;-) when a user hovers over a course node.

Written by Tony Hirst

January 30, 2011 at 8:32 pm

Posted in OU2.0, Visualisation

Tagged with ,

Getting Started With Local Council Spending Data

with 3 comments

With more and more councils doing as they were told and opening up their spending data in the name of transparency, it’s maybe worth a quick review of how the data is currently being made available.

To start with, I’m going to consider the Isle of Wight Council’s data, which was opened up earlier this week. The first data release can be found (though not easily?!) as a pair of Excel spreadsheets, both of which are just over 1 MB large, at http://www.iwight.com/council/transparency/ (This URL reminds me that it might be time to review my post on “Top Level” URL Conventions in Local Council Open Data Websites!)

The data has also been released via Spikes Cavell at Spotlight on Spend: Isle of Wight.

The Spotlight on Spend site offers a hierarchical table based view of the data; value add comes from the ability to compare spend with national averages and that of other councils. Links are also provided to monthly datasets available as a CSV download.

Uploading these datasets to Google Fusion tables shows the following columns are included in the CSV files available from Spotlight on Spend (click through the image to see the data):

Note that the Expense Area column appears to be empty, and “clumped” transaction dates use? Also note that each row, column and cell is commentable upon

The Excel spreadsheets on the Isle of Wight Council website are a little more complete – here’s the data in Google Fusion tables again (click through the image to see the data):

(It would maybe worth comparing these columns with those identified as Mandatory or Desirable in the Local Spending Data Guidance? A comparison with the format the esd use for their Linked Data cross-council local spending data demo might also be interesting?)

Note that because the Excel files on the Isle of Wight Council were larger than the 1MB size limit on XLS spreadsheet uploads to Google Fusion Tables, I had to open the spreadsheets in Excel and then export them as CSV documents. (Google Fusion Tables accepts CSV uploads for files up to 100MB.) So if you’re writing an open data sabotage manual, this maybe something worth bearing in mind (i.e. publish data in very large Excel spreadsheets)!

It’s also worth noting that if different councils use similar column headings and CSV file formats, and include a column stating the name of the council, it should be trivial to upload all their data to a common Google Fusion Table allowing comparisons to be made across councils, contractors with similar names to be identified across councils, and so on… (i.e. Google Fusion tables would probably let you do as much as Spotlight on Spend, though in a rather clunkier interface… but then again, I think there is a fusion table API…?;-)

Although the data hasn’t appeared there yet, I’m sure it won’t be long before it’s made available on OpenlyLocal:

However, the Isle of Wight’s hyperlocal news site, Ventnorblog teamed up with a local developer to revise Adrian Short’s Armchair Auditor code and released the OnTheWIght Armchair Auditor site:

So that’s a round up of where the data is, and how it’s presented. If I get a chance, the next step is to:
- compare the offerings with each other in more detail, e.g. the columns each view provides;
- compare the offerings with the guidance on release of council spending data;
- see what interesting Google Fusion table views we can come up with as “top level” reports on the Isle of Wight data;
- explore the extent to which Google Fusion Tables can be used to aggregate and compare data from across different councils.

PS related – Nodalities blog: Linked Spending Data – How and Why Bother Pt2

PPS for a list of local councils and the data they have released, see Guardian datastore: Local council spending over £500, OpenlyLocal Council Spending Dashboard

Written by Tony Hirst

January 28, 2011 at 11:59 am

Tags Associated With Other Tags on Delicious Bookmarked Resources

with 3 comments

If you’re using a particular tag to aggregate content around a particular course or event, what do the other tags used to bookmark those resource tell you about that course or event?

In a series of recent posts, I’ve started exploring again some of the structure inherent in socially bookmarked and tagged resource collections (Visualising Delicious Tag Communities Using Gephi, Social Networks on Delicious, Dominant Tags in My Delicious Network). In this post, I’m going to look at the tags that co-occur with a particular tag that may be used to bookmark resources relating to an event or course, for example.

Here are a few examples, starting with cck11, using the most recent bookmarks tagged with ‘cck11′:

The nodes are sized according to degree; the edges represent that the two tags were both applied by an individual user person to the same resource (so if three (N) tags were applied to a resource (A, B, C), there are N!/(K!(N-K)!) pairwise (K=2) combinations (AB, AC, BC; that is, three combinations in this case.).

Here are the tags for lak11 – can you tell what this online course is about from them?

Finally, here are tags for the OU course T151; again, can you tell what the course is most likely to be about?

Here’s the Python code I used to generate the gdf network definition files used to generate the diagrams shown above in Gephi:

import simplejson, urllib

def getDeliciousTagURL(tag,typ='json', num=100):
  #need to add a pager to get data when more than 1 page
  return "http://feeds.delicious.com/v2/json/tag/"+tag+"?count=100"

def getDeliciousTaggedURLTagCombos(tag):
  durl=getDeliciousTagURL(tag)
  data = simplejson.load(urllib.urlopen(durl))
  uniqTags=[]
  tagCombos=[]
  for i in data:
    tags=i['t']
    for t in tags:
      if t not in uniqTags:
        uniqTags.append(t)
    if len(tags)>1:
      for i,j in combinations(tags,2):
        print i,j
        tagCombos.append((i,j))
  f=openTimestampedFile('delicious-tagCombos',tag+'.gdf')
  header='nodedef> name VARCHAR,label VARCHAR, type VARCHAR'
  f.write(header+'\n')
  for t in uniqTags:
    f.write(t+','+t+',tag\n')
  f.write('edgedef> tag1 VARCHAR,tag2 VARCHAR\n')
  for i,j in tagCombos:
      f.write(i+','+j+'\n')
  f.close()

def combinations(iterable, r):
    # combinations('ABCD', 2) --> AB AC AD BC BD CD
    # combinations(range(4), 3) --> 012 013 023 123
    pool = tuple(iterable)
    n = len(pool)
    if r > n:
        return
    indices = range(r)
    yield tuple(pool[i] for i in indices)
    while True:
        for i in reversed(range(r)):
            if indices[i] != i + n - r:
                break
        else:
            return
        indices[i] += 1
        for j in range(i+1, r):
            indices[j] = indices[j-1] + 1
        yield tuple(pool[i] for i in indices)

Next up? I’m wondering whether a visualisation of the explicit fan/network (i.e. follower/friend) delicious network for users of a given tag might be interesting, to see how it compares to the ad hoc/informal networks that grow up around a tag?

Written by Tony Hirst

January 27, 2011 at 2:18 pm

Corporate Data Analyst and Online Comms Jobs at the OU

leave a comment »

Though I’m sure these sorts of job have been being advertised for years, it’s interesting tracking how they’re being represented at the moment, and the sorts of skills required.

Corporate Data and MI Analyst, Marketing (£29,853 – £35,646)

Main Purpose of the Post:
The post holder is a member of the Campaign Planning and Data team and will be required to play a pro-active role in that team, balancing the needs and recommendations of their own areas of responsibility with the wider needs and priorities of the team and the whole Marketing & Sales Unit.

This post has been constructed to assist the University to develop its marketing capacity so that challenging targets can be met. It will be essential for the post holder to work to harness the energies of academic and academic related staff in the University’s academic units, service units and regions to develop a more effective marketing strategy. This will require influencing and networking skills and an ability to adapt engagement style to an academic context.

The post holders work within a team producing Campaign Plans for both new and continuing students. The plans drive the allocation of over £10M of promotional activities (acquisition and retention campaigns).

Description of Duties of the Post:

Contribute to optimising the University’s customer targeting capability via regular reappraisal of segmentation policy with a view to increasing market share in high yield segments
Contribute to development and delivery of robust models, tools, skills and resources to enable segmentation, competitor and market analysis and data mining within the Campaign Planning Team and more widely within Marketing and Sales.

Planning 60%
Input into overall marketing plans and support planning process.
Segment the prospect data mart by developing key prospect indicators to provide Response, Reservation, Registration, Retention and other key metric predictions for each.
Support quantification of product performance predictions to provide Response, Reservation, Registration, Retention and other key metric predictions for each.
Maintain and contribute to development of a targeting model, which overlays product performance predictions/actual by segment over the agreed marketing plan to provide a targeting matrix.
Communicate targeting matrix to stakeholders and overlay tests and current campaign activity to provide an agreed campaign plan based on minimising Cost per Registration and maximising marketing mix and integration strategy.
Monitor performance daily and update segmentation, product and targeting models to maintain a data driven test and learn cycle. Identify significant deviations from forecast and potential actions.
Continually review the Customer Journey through input into creation of a Retention model based on a balanced scorecard approach. Work with key stakeholders to prioritise and implement developments.
Input into model validation and quality control.

Data 20%
Support development of a marketing data mart to primarily support marketing analysis and campaign execution.
Provide input into marketing data developments encouraging sharing of data and best practise.
Support development of in-house tools and processes to improve marketing analysis and campaign execution, primarily SAS and SIEBEL. Support other areas in evaluating tools and systems.
Where appropriate, maintain the relationship with OU data providers ensuring relevant data processing, development, quality and SLA’s are controlled.
Promote data use within marketing and other OU areas, maximising the use of data and providing a hub for data developments to be controlled

MI 20%
Input into development of key performance measures to be used across the OU.
Develop relationships with key OU stakeholders to ensure common goals are met.
Facilitate the use of marketing data across the OU and develop tools to support.
Support data focused research and tests with analytical input.
Input into development and maintenance of campaign performance measures.

Person Spec – Essential
Substantial experience in a campaign planning, analysis or similar role including for
example; campaign execution, data extraction, the development of data infrastructure.
Experience of Direct Marketing.
Experience of B2B and / or B2C marketing.
Experience of data propensity and segmentation modelling.
A balance of marketing analysis and technical skills, including data quality and protection.
Experience of test and learn data driven analysis, targeting processes and systems;
Proven ability to see trends in data and drill down to issues or key data.
Proven ability to develop relationships with key decision makers and stakeholders.
Proven ability to translate marketing requirements into planning / execution requirements.
Excellent presentation and facilitation skills.
Provide analytic support and direction to colleagues to ensure understanding.
Proven ability to meet challenging deadlines without compromising quality.

Still no adverts* for a “learning data analyst” though, tasked with analysing data to see:

- whether effective use is being made of linked to resources, particularly subscription Library content and open educational resources;

- whether there’s anything in the student activity data and/or social connection data we can use to predict attainment and/or satisfaction levels or improve retention levels.

* That said, IET do collect lots of stats, and I think a variety of stats are now available relating to activity on the Moodle VLE. I’m not sure who does what with all that data though…?

PS I wonder if any of the analysts that companies like Pearson presumably employ look to model ways of maximising the profitabilty, to those companies, of student acquisition and retention, given education is their business? (See also: Apollo Group results – BPP and University of Phoenix, Publishing giant Pearson looks set to offer degrees).

PPS This job ad may also be of interest to some? Online Communications Officer, Open University Business School (£29,853 – £35,646)

Again, it’s interesting to mark what’s expected…

This brand new role in the School will drive the development of online communications. Focusing on increasing engagement and traffic through the website, you will ensure this work is appropriately integrated into the wider work of the University’s Online Services, Marketing and Communications teams. Reporting to the Director of Business Development and External Affairs, it will be your responsibility to develop the website including content, usability, optimisation, interactivity and driving increased visitor numbers and online registration. You will continually find new and inventive ways to engage with our stakeholders and promote the reputation of the Business School through the online channel.

Your responsibilities will also extend to the School’s virtual presence through social networks, iTunes U and YouTube and utilise these channels to our advantage. You will increase our presence as well as delivering virtual campaigns to improve the overall student numbers. In this role, it will also be your responsibility to develop relationships with other areas of the University engaged in this work and will play a key role in the management of these relationships.

Summary of Duties
The main duties of the Online Communications Officer are detailed below.

• Advance the social media strategy ensuring it is inline with the Universities media position, market response and the development of new technology.
• Manage the online activity of the Business School’s social media communities
• Liaise, as appropriate, with units within The Open University, such as Online Services to keep up to date with policy changes and AACS regarding technical developments.
• Liaise with the Business School’s Information Officer for the maintenance and feeding of the Research Database into the website
• Generate assets to host on the website e.g. an Elluminate Demo Video
• Keep abreast of trends and developments to ensure that the Business School’s online presence remains at the forefront
• Work alongside Online Services, to monitor the visitor traffic of the website and establish appropriate and effective KPIs for dissemination across the Business School, for example through the creation of a dashboard.
• Engage in personal development based on organisational needs and developments to foster a high level of professional skills and technical ability
• Ensure that corporate branding and media guidelines are adhered to
• Understand and appreciate internal procedures and standards and be proactive in recommending improvements
• Ability to apply best professional practice to deliver effective solutions that take into account technical, budgetary and other project considerations
• Edit the content of both the internet and intranet
• Collate, interpret and select key information for dissemination on the latest trends and research in social media both within the OU and externally
• Produce graphics where necessary or liaise with designers in the University or outside agencies to produce graphics.
• Create/collate digital assets including audio and video files
• Post moderate discussion forums
• Disseminate best practice through a variety of communications channels eg project website, OU Life news, brief updates etc.
• Develop and maintain awareness of different audience needs in relation to appropriate communications channels (eg email, screensaver, website, print).
• Act as a flexible member of the Business Development and External Affairs team.
• Carry out other tasks as specified from time to time by the Project Director

Related: Joanne Jacobs’ Are you social or anti-social?: “How to employ a Social Media Strategist, and how you should measure their performance. (Social media isn’t going away. But some Social Media Strategists should go away.)”

Written by Tony Hirst

January 27, 2011 at 11:51 am

Posted in Analytics, Jobs

Visualising OU Academic Participation with the BBC’s “In Our Time”

with 3 comments

Although not an OU/BBC co-pro, the “get some academics in to chat to Melvyn” format of BBC Radio 4′s In Our Time means that the OU has, over the years, had a handful of academics appearing on the programme. I’ve been mulling over opportunities for playing with the BBC programmes linked data (no RDF required) I wondered how easy it would be to grab the programmes that OU academics have appeared on. For example, it’s increasingly possible to see programmes associated with particular places (h/t to @gothwin for that; see his post on A Crude BBC Places Linked Data mashup for an application of that data) although the organisations listing is still a bit sparse.

Looking through the programme data, the participants in a programme are listed separately, but not their affiliations. However, in the free text that is used in the long synopsis of the programme, a convention exists to identify the guests, with affiliation or short bio, who appeared on that particular programme:

In the post Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags, I described how the Thomson Reuters’ OpenCalais entity extraction/semantic tagging service could be used to augment the BBC programme data with additional data fields based on analysis of the supplied text. One of the extraction services identifies a set of related fields termed PersonCareer, which detail (where possible) the name of a person, their role, and the organisation they work for. The convention used to list the guests on each programme is appropriate for the extraction of PersonCareer data, at least in some cases.

Rather more reliable is the extraction of University names as Facility data types. What this means is that we can tag each programme with a list of Facilities relating to the universities represented by guests on the programme, and then – where a PersonCareer is extracted, attempt to text match the PersonCareer/Organization name with the extracted Facility name. (Sample code is available here. I had “issues” with character encodings, so there is an element if hackery involved:-( In order to aggregate data from across programmes in the series, I built up a network of programmes and participating institutions using a NetworkX representation, which then gets dumped to output files in a variety of graph formats.)

Here’s an example of the output, filtered to show programmes and programme tags (from the BBC data, rather than Calais extracted tags) that had some sort of association with the Open University:

The above diagram is actually a filtered view over the whole programme’n'university representation network using the Gephi ego network filter:

Node sizing is related to degree in this sub-network, and nodes are coloured according to node type (person, institution, tag, programme.) The graph shows programmes that an OU academic appeared on, and (where possible) which OU academic, by name. Programme tags from the BBC programme data are also shown, as are other institutions that appeared on the same programmes as the OU.

Here’s a snapshot of the full graph – you’ll notice there is some mismatch* in references between the universities mentioned that could possibly be reconciled using a string similarity technique or maybe running the data through Google Refine and using one or more of its string similarity/reconciliation tools.

* things are actually even more pathological: in some cases, I think that Oxbridge Colleges may be identified in PersonCareer metadata as the career organisation, rather than the university affiliation, which may well have been recognised as a Facility. If an organisation identified in a PersonCareer is not one of the Facilities added that has been identified and added to the graph, the organisation is also added. The question we’re left with is: do the errors such as they are make this graph, such as it is, completely use less, or is it better than nothing and something we can work with and improve incrementally as and how we can. [UPDATE: related maybe? Making Linked Data work isn’t the problem]

I’m not sure what the next step should be, but linking the OU ego-graph into the OU Linked Data would be one way forward. For example, displaying papers in ORO authored by appearing academics, or trying to relate programmes to related courses on the OU course catalogue (or even though not indexed in the OU Linked Data store, courses on OpenLearn). A big problem with brokering the Linked Data connections is that I’d have to do free text/regular expression searches on the OU Linked Data store using terms from the BBC/OpenCalais data. THat is, there are no common unique identifier/URIs that can be used as “proper” linking terms:-(

Written by Tony Hirst

January 21, 2011 at 7:00 pm

Posted in BBC, Data, OBU, OU2.0

Tagged with

Putting Public Open Data to Work…?

with 6 comments

A couple of posts out today on the Guardian Datablog review the progress of the UK’s open government data project to date (Government data UK: what’s really been achieved?, Nigel Shadbolt: A year of data.gov.uk).

One of the pieces of the jigsaw that I think has been largely ignored in many of the discussions I have seen around open data is the extent to which open data is incorporated into the productive workflow of an organisation, rather than just venting the data exhaust of an organisation every so often and pretending something useful has been released…

One of the reasons commonly given for why organisations should open up their data is the idea that a more effective use of your data may well be found by someone else: for example, by identifying previously unimagined ways of unlocking value, or exploiting network effects that arise from being able to merge one dataset with another.

But it seems to me that the data should also be useful to the organisation that released (otherwise, why collect it?), and that one way of making the most of open data stores is to put the data to work, by requiring day-to-day users of the data to access it via the datastore (an “eat your own dog food” argument…).

If data is collected and reported on within an organisation, then consideration should be given as to whether the workflow associated with that data might pass through an open data store, or whether the open data store might provide a view over data as it passes through that workflow. Where data is reported in a public (and maybe even just FOIable) way, submission of the report – for example, to a Ministry – might pass through the open data store. That is, rather than local gov sending data based reports to central gov, central gov picks up those reports from the local gov’s open data site.

That is: in cases where public institutions are currently expected to push open data and reports based on open data to other public institutions (or maybe other internal departments), they should start pushing that data to public/open datastores, and let the other party pull the data from that public data store.

What this means is that the business of data is, wherever possible, is mediated through open and public datastores. [RELATED: list of local government data burdens]

Another approach that I think might hold promise is for local councils to develop on their own, or in partnership with other councils or private enterprise, data driven sites in areas where they have a particular specialism or interest (which may include a potentially commercial interest), and then offer these services under a subscription basis to other councils across the UK and in so doing develop a national reach. For national scale delivery, it might be that this is handled by a commercial partner, with the original developing council taking a commercial stake and getting a return.

The national aggregation of local services idea is worth bearing in mind because we shouldn’t necessarily expect a user of a council data powered website delivering location based or location related services to know which council a particular location falls in.

For example, (and I know this isn’t a council service…), something that is done locally but at national scale is blood donation.

The NHS Blood donor service site provides a means of identifying the time and place for local collection sessions. It may well be that the the data is generated on a regional basis, but why should there be any more than a single place to go to find out this information?

Here are some more examples of services started locally that might scale:

RateMyPlace – food inspection ratings, currently for a handful of councils in the Staffordshire area.

Who Owns My Neighbourhood? – identify council owned land and buildings, currently in the Kirklees area.

Your Ceremony – venue hire and registrar booking for civil ceremonies, via East Cheshire council.

It’s not just councils who might initiate these vertical, cross-council vertical data or service sites of course. For example, FixMyStreet is a MySociety site for reporting local issues relating mainly to the built environment, such as potholes. FixMyStreet engages citizens in the human level instrumentation of local neighbourhoods, changing attitudes as it does from “the making of complaints” to “the reporting of issues”. FixMyStreet also offers other issue tracking features such as the ability for issues to be “closed” when they are fixed. I don’t find it hard to imagine that councils might pay a small subscription for additional reporting and management tools built around the FixMyStreet workflow, although the practicalities of that might be very different! The FOI requesting site What Do They Know could also be seen in a similar light as a workflow for managing FOI requests.

Architecturally, there may be several different approaches to the design of these sites, and how to engage with them. For example, if a site has a write API, or can import data from a variety of document formats, albeit ones structured in a particular way, a council might easily write or upload data that can be normalised and presented in a consistent way by the aggregating site.

For councils that do publish data, but in an inconsistent way, aggregators such as OpenlyLocal attempt to scrape the data and normalise it to provide a uniformity of access to data, in a consistent from, across council regions.

Where aggregation sites normalise data and represent it via an API, it provides an opportunity for third party developers or vendors to invest in the production of single application that can be configured to present localised views over the data to individual councils, or allow council developers to share widget code with other councils.

Sites that aggregate local data at national scale are thus beneficial in several ways:
- they provide consistency of experience for folk who move between areas and mask boundaries between the sources of data (which is fine, if everyone os reporting in a consistent way…);
- they minimise the need for a user to know which council area they are in;
- they provide the ability to compare activity across neighbouring regions;
- they provide the opportunity for additional services operating at national scale to make use of the data.

PS if you know of any other councils that have developed vertical sites, ideally data driven ones, such as RateMyPlace or Who Owns My Neighbourhood, that could scale nationally and mask council borders, please add a link in the comments:-)

A good place to look for new services may be workflows where there is the production of standardised reports, e.g. to central government or using standard procedures, and perhaps more revealingly, standard forms! That is, if every local gov inspector across the UK users form bZ23/A to file a particular sort of report, it could suggest that a data driven service around that data might scale to a national level… or not;-)

PS for a corollary to this – data horizontals with a local focus, see Greg Hadfield on local news sites “re-inventing themselves as local data hubs” (Open-data cities: a lifeline for local newspapers).

Written by Tony Hirst

January 21, 2011 at 1:28 pm

Posted in Data, Policy, Thinkses

A Few More Thoughts on GetTheData.org

leave a comment »

As we come up to a week in on GetTheData.org, there’s already an interesting collection of questions – and answers – starting to appear on the site, along with a fledgling community (thanks for chipping in, folks:-), so how can we maintain – and hopefully grow – interest in the site?

A couple of things strike me as the most likely things to make the site attractive to folk:

- the ability to find an appropriate – and useful – answer to your question without having to ask it, for example because someone has already asked the same, or a similar, question;
- timely responses to questions once asked (which leads to a sense of community, as well as utility).

I think it’s also worth bearing in mind the context that GetTheData sits in. Many of the questions result in answers that point to data resources that are listed in other directories. (The links may go to either the data home page or its directory page on a data directory site.)

Data Recommendations
One thing I think is worth exploring is the extent to which GetTheData can both receive and offer recommendations to other websites. Within a couple of days of releasing the site, Rufus had added a recommendation widget that could recommend datasets hosted on CKAN that seem to be related to a particular question.

GetTheData.org - related datasets on CKAN

What this means is that even before you get a reply, a recommendation might be made to you of a dataset that meets your requirements.

(As with many other Q&A sites, GetTheData also tries to suggest related questions to you when you enter you question, to prompt you to consider whether or not your question has already been asked – and answered.)

I think the recommendation context is something we might be able to explore further, both in terms of linking to recommendations of related data on other websites, but also in the sense of reverse links from GetTheData to those sites.

For example:

- would it be possible to have a recommendation widget on GetTheData that links to related datasets from the Guardian datastore, or National Statistics?
- are there other data directory sites that can take one or more search terms and return a list of related datasets?
- could a getTheData widget be located on CKAN data package pages to alert package owners/maintainers that a question possibly related to the dataset had been posted on GetTheData? This might encourage the data package maintainer to answer the question on the getTheData site with a link back to the CKAN data package page.

As well as recommendations, would it be useful for GetTheData to syndicate new questions asked on the site? For example, I wonder if the Guardian Datastore blog would be willing to add the new questions feed to the other datablogs they syndicate?;-) (Disclosure: data tagged posts from OUseful.info get syndicated in that way.)

Although I don’t have any good examples of this to hand from GetTheData, it strikes me that we might start to see questions that relate to obtaining data which is actually a view over a particular data set. This view might be best obtained via a particular query onto a particular data set. such as a specific SPARQL query on a Linked Data set, or a particular Google query language request to the visualisation API against a particular Google spreadsheet.

If we do start to see such queries, then it would be useful to aggregate these around the datastores they relate to, though I’m not sure how we could best do this at the moment other than by tagging?

News announcements
There are a wide variety of sites publishing data independently, and a fractured networked of data directories and data catalogues. Would it make sense for GetTheData to aggregate news announcements relating to the release of new data sets, and somehow use these to provide additional recommendations around data sets?

Hackdays and Data Fridays
As suggested in Bootstrapping GetTheData.org for All Your Public Open Data Questions and Answers:

If you’re running a hackday, why not use GetTheData.org to post questions arising in the scoping the hacks, tweet a link to the question to your event backchannel and give the remote participants a chance to contribute back, at the same time adding to the online legacy of your event.

Alternatively, how about “Data Fridays”, on the first Friday in the month, where folk agree to check GetTheData two or three times that day and engage in something of a distributed data related Question and Answer sprint, helping answer unanswered questions, and maybe pitching in a few new ones?

Aggregated Search
It would be easy enough to put together a Google custom search engine that searches over the domains of data aggregation sites, and possibly also offer filetype search limits?

So What Next?
Err, that’s it for now…;-) Unless you fancy seeing if there’s a question you can help out on right now at GetTheData.org

Written by Tony Hirst

January 20, 2011 at 8:07 pm

Posted in Data

Tagged with , , , ,

Bootstrapping GetTheData.org for All Your Public Open Data Questions and Answers

with 4 comments

Where can I find a list of hospitals in the UK along with their location data? Or historical weather data for the UK? Or how do I find the county from a postcode, or a book title from its ISBN? And is there any way you can give me RDF Linked Data in a format I can actually use?!

With increasing amounts of data available, it can still be hard to:

- find the data you you want;
- query a datasource to return just the data you want;
- get the data from a datasource in a particular format;
- convert data from one format to another (Excel to RDF, for example, or CSV to JSON);
- get data into a representation that means it can be easily visualised using a pre-existing tool.

In some cases the data will exist in a queryable and machine readable form somewhere, if only you knew where to look. In other cases, you might have found a data source but lack the query writing expertise to get hold of just the data you want in a format you can make use of. Or maybe you know the data is in Linked Data store on data.gov.uk, but you just can’t figure how to get it out?

This is where GetTheData.org comes in. Get The Data arose out of a conversation between myself and Rufus Pollock at the end of last year, which resulted with Rufus setting up the site now known as getTheData.org.

getTheData.org

The idea behind the site is to field questions and answers relating to the practicalities of working with public open data: from discovering data sets, to combining data from different sources in appropriate ways, getting data into formats you can happily work with, or that will play nicely with visualisation or analysis tools you already have, and so on.

At the moment, the site is in its startup/bootstrapping phase, although there is already some handy information up there. What we need now are your questions and answers…

So, if you publish data via some sort of API or queryable interface, why not considering posting self-answered questions using examples from your FAQ?

If you’re running a hackday, why not use GetTheData.org to post questions arising in the scoping the hacks, tweet a link to the question to your event backchannel and give the remote participants a chance to contribute back, at the same time adding to the online legacy of your event.

If you’re looking for data as part of a research project, but can’t find it or can’t get it in an appropriate form that lets you link it to another data set, post a question to GetTheData.

If you want to do some graphical analysis on a data set, but don’t know what tool to use, or how to get the data in the right format for a particular tool, that’d be a good question to ask too.

Which is to say: if you want to GetTheData, but can’t do so for whatever reason, just ask… GetTheData.org

Written by Tony Hirst

January 17, 2011 at 1:27 pm

Matplotlib: Detrending Time Series Data

with one comment

Reading the rather wonderful Data Analysis with Open Source Tools (if you haven’t already got a copy, get one… NOW…), I noticed a comment that autocorrelation “is intended for time series that do not exhibit a trend and have zero mean”. Doh! Doh! And three times: doh!

I’d already come the same conclusion, pragmatically, in Identifying Periodic Google Trends, Part 1: Autocorrelation and Improving Autocorrelation Calculations on Google Trends Data but now I’ll be remembering this as a condition of use;-)

One issue I had come across with trying to plot a copy of the mean zero and/or detrended data as calculated using Matplotlib was how to plot a copy of the detrended data directly. (I’d already worked out how to use the detrend_ filters in the autocorrelation function).

The problem I had was simply trying plot mlab.detrend_linear(y) as applied to list of values y threw an error (“AttributeError: ‘list’ object has no attribute ‘mean’”). It seems that detrend expects y=[1,2,3] to have a method y.mean(); which it doesn’t, normally…

The trick appears to be that matplotlib prefers to use something like a structured array, rather than a simple list, which offers these additional methods. Biut how is the data structured? A simple Google wasn’t much help, but a couple of cribs suggested that casting the list to y=np.array(y) (where import numpy as np) might be a good idea.

So let’s try it:

import matplotlib.pyplot as plt
import numpy as np

label='run'
d=[0.99,0.98,0.95,0.93,0.91,0.93,0.92,0.95,0.95,0.94,0.96,0.98,0.97,1.00,1.01,1.05,1.06,1.06,0.98,0.98,0.98,0.97,0.96,0.93,0.93,0.96,0.95,1.05,0.97,0.95,1.01,1.02,0.98,1.01,0.98,1.00,1.06,1.04,1.06,1.04,0.97,0.94,0.92,0.90,0.87,0.88,0.85,0.90,0.91,0.87,0.88,0.88,0.91,0.91,0.88,0.91,0.92,0.91,0.90,0.92,0.87,0.92,0.92,0.92,0.94,0.97,0.99,1.01,1.01,1.04,0.97,0.94,0.98,0.94,0.98,0.91,0.93,0.92,0.95,1.00,0.93,0.93,0.96,0.96,0.96,0.97,0.95,0.95,1.06,1.12,1.01,1.00,0.99,0.98,0.96,0.93,0.91,0.92,0.92,0.94,0.94,0.94,0.90,0.86,0.89,0.93,0.90,0.90,0.90,0.90,0.89,0.92,0.91,0.92,0.93,0.93,0.94,0.99,0.98,0.99,1.01,1.06,1.06,0.96,0.98,0.92,0.92,0.93,0.91,0.90,0.93,1.02,0.90,0.93,0.91,0.93,0.95,0.93,0.91,0.92,0.96,0.93,1.02,1.02,0.91,0.88,0.87,0.87,0.84,0.82,0.82,0.84,0.83,0.85,0.80,0.80,0.87,0.85,0.83,0.80,0.84,0.83,0.84,0.88,0.83,0.88,0.88,0.86,0.91,0.93,0.91,0.97,0.96,1.00,1.01,0.98,0.94,0.97,0.94,0.95,0.92,0.93,0.97,1.02,0.95,0.92,0.91,0.95,0.93,0.94,0.91,0.92,0.98,0.99,0.97,0.98,0.90,0.86,0.87,0.91,0.87,0.86,0.86,0.89,0.89,0.87,0.86,0.83,0.85,0.86,0.90,0.87,0.87,0.90,0.89,0.93,0.93,0.97,0.99,0.95,1.00,1.05,1.03,1.04,1.08,1.05,1.05,1.05,1.05,1.01,1.07,1.02,1.02,1.04,1.00,1.04,1.17,1.03,1.01,1.02,1.05,1.06,1.05,0.99,1.07,1.03,1.05,1.07,1.04,0.97,0.94,0.97,0.93,0.94,0.96,0.96,1.04,1.05,1.04,0.96,1.00,1.04,1.01,1.00,0.99,0.99,0.99,1.03,1.05,1.02,1.06,1.07,1.04,1.16,1.19,1.12,1.18,1.19,1.16,1.12,1.12,1.09,1.12,1.11,1.12,1.06,1.05,1.14,1.26,1.09,1.12,1.13,1.16,1.18,1.22,1.17,1.24,1.28,1.35,1.19,1.16,1.11,1.11,1.13,1.13,1.11,1.09,1.06,1.07,1.09,1.09,1.03,1.05,1.04,1.04,1.03,1.03,1.06,1.09,1.17,1.12,1.11,1.14,1.20,1.18,1.24,1.19,1.21,1.22,1.22,1.27,1.25,1.18,1.15,1.18,1.17,1.11,1.09,1.10,1.12,1.26,1.15,1.15,1.16,1.16,1.15,1.12,1.15,1.14,1.20,1.31,1.17,1.18,1.14,1.15,1.14,1.12,1.17,1.11,1.10,1.11,1.14,1.10,1.08,1.06]

fig = plt.figure()
da=np.array(d)

ax1 = fig.add_subplot(211)
ax1.plot(da)

ax2 = fig.add_subplot(211)
y= mlab.detrend_linear(da)
ax2.plot(y)

ax3 = fig.add_subplot(211)
ax3.plot(da-y)

Here’s the result:

The top, ragged trace is the original data (in the d list); the lower trace is the same data, detrended; the straight line is the line that is subtracted from the original data to produce the detrended data.

The lower trace would be the one that gets used by the autocorrelation function using the detrend_linear setting. (To detrend based on simply setting the mean to zero, I think all we need to do is process da-da.mean()?

UPDATE: One of the problems with detrending the time series data using the linear trend is that the increasing trend doesn’t appear to start until midway through the series. Another approach to cleaning the data is to use remove the mean and trend by using the first difference of the signal: d(x)=f(x)-f(x-1). It’s calculated as follows:

#time series data in d
#first difference
fd=np.diff(d)

Here’s the linearly detrended data (green) compared to the first difference of the data (blue):

Note that the length of the first difference signal is one sample less than the orginal data, and shifted to the left one step. (There’s presumably a numpy way of padding the head or tail of the series, though I’m not sure what it is yet!)

Here’s the autocorrelation of the first difference signal – if you refer back to the previous post, you’ll see it’s much clearer in this case:

It is possible to pass an arbitrary detrending function into acorr, but I think it needs to return an array that is the same length as the original array?

So what next? Looking at the original data, it is quite noisy, with some apparently obvious to the eye trends. The diff calculation is quite sensitive to this noise, so it possibly makes sense to smooth the data prior to calculating the first difference and the autocorrelation. But that’s for next time…

Written by Tony Hirst

January 15, 2011 at 5:33 pm

Posted in Analytics, Data, Visualisation

Tagged with

Dominant Tags in My Delicious Network

with one comment

Following on from Social Networks on Delicious, here’s a view over my delicious network (that is, the folk I “follow” on delicious) and the dominant tags they use:

The image is created from a source file generated by:

1) grabbing the list of folk in my delicious network;
2) grabbing the tags each of them uses;
3) generating a bipartite network specification graph containing user and edge nodes, with weighted links corresponding to the number of times a user has used a particular tag (i.e. the number of bookmarks they have bookmarked using that tag).

Because the original graph is a large, sparse one (many users define lots of tags but only use them rarely), I filtered the output view to show only those tags that have been used more than 150 times each by any particular user, based on the weight of each edge (remember, the edge weight describes the number of times a used has used a particular tag). (So if every user had used the same tag up to but not more 149 times each, it wouldn’t be displayed). The tag nodes are sized according to the number of users who have used the tag 150 or more times.

I also had a go at colouring the nodes to identify tags used heavily by a single user, compared to tags heavily used by several members of my network.

Here’s the Python code:

import urllib, simplejson

def getDeliciousUserNetwork(user,network):
  url='http://feeds.delicious.com/v2/json/networkmembers/'+user
  data = simplejson.load(urllib.urlopen(url))
  for u in data:
    network.append(u['user'])
    #time also available: u['dt']
  #print network
  return network

def getDeliciousTagsByUser(user):
  tags={}
  url='http://feeds.delicious.com/v2/json/tags/'+user
  data = simplejson.load(urllib.urlopen(url))
  for tag in data:
    tags[tag]=data[tag]
  return tags

def printDeliciousTagsByNetwork(user,minVal=2):
  f=openTimestampedFile('delicious-socialNetwork','network-tags-' + user+'.gdf')
  f.write(gephiCoreGDFNodeHeader(typ='delicious')+'\n')
 
  network=[]
  network=getDeliciousUserNetwork(user,network)

  for user in network:
    f.write(user+','+user+',user\n')
  f.write('edgedef> user1 VARCHAR,user2 VARCHAR,weight DOUBLE\n')
  for user in network:
    tags={}
    tags=getDeliciousTagsByUser(user)
    for tag in tags:
      if tags[tag]>=minVal:
         f.write(user+',"'+tag.encode('ascii','ignore') + '",'+str(tags[tag])+'\n')
  f.close()

Looking at the network, it’s possible to see which members of my network are heavy users of a particular tag, and furthermore, which tags are heavily used by more than one member of my network. The question now is: to what extent might this information help me identify whether or not I am following people who are likely to turn up resources that are in my interest area, by virtue of the tags used by the members of my network.

Picking up on the previous post on Social Networks on Delicious, might it be worth looking at the tags used heavily by my followers to see what subject areas they are interested in, and potentially the topic area(s) in which they see me as acting as a resource investigator?

Written by Tony Hirst

January 14, 2011 at 1:47 pm

Follow

Get every new post delivered to your Inbox.

Join 126 other followers