OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Analytics’ Category

A Couple of Notes on “List Intelligence”

Just so I don’t forget the development timeline such as it is, here are a few quick notes-to-self as much as anything about my “List Intelligence” tinkering to date:

  • List Intelligence uses (currently) Twitter lists to associate individuals with a particular topic area (the focus of the list; note that this may be ill-specified, e.g. “people I have met”, or topic focussed “OU employees”, etc)
  • List Intelligence is presented with a set of “candidate members” and then:
    1. looks up the lists those candidate members are on to provide a set of “candidate lists”;
    2. identifies the membership of those candidate lists (“candidate list members”) (this set may be subject to ranking or filtering, for example based on the number of list subscribers, or the number of original candidate members who are members of the current list);
    3. for the superset of members across lists (i.e. the set of candidate list members), rank each individual compared to the number of lists they are on (this may be optionally weighted by the number of subscribers to each list they are on); these individuals are potentially “key” players in the subject area defined by the lists that the original candidate members are members of;
    4. identify which of the candidate lists contains most candidate members, and rank accordingly (possibly also according to subscriber numbers); the top ranked lists are lists trivially associated with the set of original candidate members;
    5. provide output files that allow the graphing of individuals who are co-members of the same sets, and use the corresponding network as the basis for network analysis;
    6. optionally generate graphs based on friendship connections between candidate list members, and use the resulting graph as the basis for network analysis. (Any clusters/communities detected based on friendship may then be compared with the co-membership graphs to see the extent to which list memberships reflect or correlate to community structures);
  • the original set of candidate members may be defined in a variety of ways. For example:
    1. one or more named individuals;
    2. the friends of a named individual;
    3. the recent users of a particular hashtag;
    4. the recent users of a particular searched for term;
    5. the members of a “seed” list.
  • List Intelligence attempts to identify “list clusters” in the candidate lists set by detecting significant overlaps in membership between different candidate lists.
  • Candidate lists may be used to identify potential “focus of interest” areas associated with the original set of candidate members.

I’ll try to post some pseudo-code, flow charts and formal algorithms to describe the above… but it may take a week or two…

Written by Tony Hirst

June 24, 2011 at 5:35 pm

Follower Networks and “List Intelligence” List Contexts for @JiscCetis

I’ve been tinkering with some of my “List Intelligence” code again, and thought it worth capturing some examples of the sort of network exploration recipes I’m messing around with at the moment.

Let’s take @jiscCetis as an example; this account follows no-one, is followed by a few, hasnlt much of a tweet history and is listed by a handful of others.

Here’s the follower network, based on how the followers of @jiscetis follow each other:

Friend connections between @Jisccetis followers

There are three (maybe four) clusters there, plus all the folk who don’t follow any of the @jisccetis’ followers…: do these follower clusters make any sort of sense I wonder? (How would we label them…?)

The next thing I thought to do was look at the people who were on the same lists as @jisccetis, and get an overview of the territory that @jisccetis inhabits by virtue of shared list membership.

Here’s a quick view over the folk on lists that @jisccetis is a member of. The nodes are users named on the lists that @jisccetis is named on, the edges are undirected and join indivduals who are on the same list.

Distribution of users named on lists that jisccetis is a member of

Plotting “co-membership” edges is hugely expensive in terms of upping the edge count that has to be rendered, but we can use a directed bipartite graph to render the same information (and arguably even more information); here, there are two sorts of nodes: lists, and the memvers of lists. Edges go from members to listnames (I should swap this direction really to make more sense of authority/hub metrics…?)

jisccetis co-list membership

Another thing I thought I’d explore is the structure of the co-list membership community. That is, for all the people on the lists that @jisccetis is a member of, how do those users follow each other?

How folk on same lists as @jisccetis follow each other

It may be interesting to explore in a formal way the extent to which the community groups that appear to arise from the friending relationships are reflected (or not) by the make up of the lists?

It would probably also be worth trying to label the follower group – are there “meaningful” (to @jisccetis? to the @jisccetis community?) clusters in there? How would you label the different colour groupings? (Let me know in the comments…;-)

Written by Tony Hirst

June 18, 2011 at 7:55 pm

Identifying the Twitterati Using List Analysis

Given absolutely no-one picked up on List Intelligence – Finding Reliable, Trustworthy and Comprehensive Topic/Sector Based Twitter Lists, here’s a example of what the technique might be good for…

Seeing the tag #edusum11 in my feed today, and not being minded to follow it it I used the list intelligence hack to see:

- which lists might be related to the topic area covered by the tag, based on looking at which Twitter lists folk recently using the tag appear on;
- which folk on twitter might be influential in the area, based on their presence on lists identified as maybe relevant to the topic associated with the tag…

Here’s what I found…

Some lists that maybe relate to the topic area (username/list, number of folk who used the hashtag appearing on the list, number of list subscribers), sorted by number of people using the tag present on the list:

/joedale/ukedtech 6 6
/TWMarkChambers/edict 6 32
/stevebob79/education-and-ict 5 28
/mhawksey/purposed 5 38
/fosteronomo/chalkstars-combined 5 12
/kamyousaf/uk-ict-education 5 77
/ssat_lia/lia 5 5
/tlists/edtech-995 4 42
/ICTDani/teched 4 33
/NickSpeller/buzzingeducators 4 2
/SchoolDuggery/uk-ed-admin-consultancy 4 65
/briankotts/educatorsuk 4 38
/JordanSkole/jutechtlets 4 10
/nyzzi_ann/teacher-type-people 4 9
/Alexandragibson/education 4 3
/danielrolo/teachers 4 20
/cstatucki/educators 4 13
/helenwhd/e-learning 4 29
/TechSmithEDU/courosalets 4 2
/JordanSkole/chalkstars-14 4 25
/deerwood/edtech 4 144

Some lists that maybe relate to the topic area (username/list, number of folk who used the hashtag appearing on the list, number of list subscribers), sorted by number of people subscribing to the list (a possible ranking factor for the list):
/deerwood/edtech 4 144
/kamyousaf/uk-ict-education 5 77
/SchoolDuggery/uk-ed-admin-consultancy 4 65
/tlists/edtech-995 4 42
/mhawksey/purposed 5 38
/briankotts/educatorsuk 4 38
/ICTDani/teched 4 33
/TWMarkChambers/edict 6 32
/helenwhd/e-learning 4 29
/stevebob79/education-and-ict 5 28
/JordanSkole/chalkstars-14 4 25
/danielrolo/teachers 4 20
/cstatucki/educators 4 13
/fosteronomo/chalkstars-combined 5 12
/JordanSkole/jutechtlets 4 10
/nyzzi_ann/teacher-type-people 4 9
/joedale/ukedtech 6 6
/ssat_lia/lia 5 5
/Alexandragibson/education 4 3
/NickSpeller/buzzingeducators 4 2
/TechSmithEDU/courosalets 4 2

Other ranking factors might include the follower count, or factors from some sort of social network analysis, of the list maintainer.

Having got a set of lists, we can then look for people who appear on lots of those lists to see who might be influential in the area. Here’s the top 10 (user, number of lists they appear on, friend count, follower count, number of tweets, time of arrival on twitter):

['terryfreedman', 9, 4570, 4831, 6946, datetime.datetime(2007, 6, 21, 16, 41, 17)]
['theokk', 9, 1564, 1693, 12029, datetime.datetime(2007, 3, 16, 14, 36, 2)]
['dawnhallybone', 8, 1482, 1807, 18997, datetime.datetime(2008, 5, 19, 14, 40, 50)]
['josiefraser', 8, 1111, 7624, 17971, datetime.datetime(2007, 2, 2, 8, 58, 46)]
['tonyparkin', 8, 509, 1715, 13274, datetime.datetime(2007, 7, 18, 16, 22, 53)]
['dughall', 8, 2022, 2794, 16961, datetime.datetime(2009, 1, 7, 9, 5, 50)]
['jamesclay', 8, 453, 2552, 22243, datetime.datetime(2007, 3, 26, 8, 20)]
['timbuckteeth', 8, 1125, 7198, 26150, datetime.datetime(2007, 12, 22, 17, 17, 35)]
['tombarrett', 8, 10949, 13665, 19135, datetime.datetime(2007, 11, 3, 11, 45, 50)]
['daibarnes', 8, 1592, 2592, 7673, datetime.datetime(2008, 3, 13, 23, 20, 1)]

The algorithms I’m using have a handful of tuneable parameters, which means there’s all sorts of scope for running with this idea in a “research” context…

One possible issue that occurred to me was that identified lists might actually cover different topic areas – this is something I need to ponder…

Written by Tony Hirst

June 9, 2011 at 6:55 pm

eSTEeM Project: Library Website Tracking For VLE Referrals

Assuming my projects haven’t been cut out at the final acceptance stage because I haven’t yet submitted a revised project plan,

Preamble
As OU courses are increasingly presented through the VLE, many of them opt to have one or more “Library Resources” pages that contain links to course related resources either hosted on the OU Library website or made available through a Library operated web service. Links to Library hosted or moderated resources may also appear inline in course content on the VLE. However, at the current time, it is difficult to get much idea about the extent to which any of these resources are ever accessed, or how students on a course make use of other Library resources.

With the state of the collection and reporting of activity data from the VLE still evolving, this project will explore the extent to which we can make use of data I do know exists, and to which I do have access, specifically Google Analytics data for the library.open.ac.uk domain.

The intention is to produce a three-way reporting framework using Google Analytics for visitors to the OU Library website and Library managed resources from the VLE. The reports will be targeted at: subject librarians who liaise with course teams; course teams; subscription managers.

Google Analytics (to which I have access) are already running on the library website and the matter just(?!) arises now of:

1) Identifying appropriate filters and segments to capture visits from different courses;

2) development of Google Analytics API wrapper calls to capture data by course or resource based segments and enable analysis, visualisation and reporting not supported within the Google Analytics environment.

3) Providing a meaningful reporting format for the three audience types. (note: we might also explore whether a view over the activity data may be appropriate for presenting back to students on a course.)

The Project
The OU Library has been running Google Analytics for several year, but to my knowledge has not started to exploit the data being collected as part of a reporting strategy on the usage of library resources resulting from referrals from the VLE. (Whenever a user clicks on a link in the VLE that leads to the Library website, the Google Analytics on the Library website can capture that fact.)

At the moment, we do not tend to work on optimising our online courses as websites so that they deliver the sorts of behaviour we want to encourage. If we were a web company, we would regularly analyse user behaviour on our course websites and modify them as a result.

This project represents the first step in a web analytics approach to understanding how our students access Library resources from the VLE: reporting. The project will then provide the basis for a follow on project that can look at how we can take insight from those reports and make them actionable, for example in the redesign of the way links to library resources are presented or used in the VLE, or how visitors from the VLE are handled when they hit the Library website.

The project complements work that has just started in the Library on a JISC funded project to making journal recommendations to students based on previous user actions.

The first outcome will be a set of Google Analytics filters and advanced segments tuned to the VLE visitor traffic and resource usage on the Library website. The second will be a set of Google analytics API wrappers that allow us to export this data and use it outside the Google Analytics environment.

The final deliverables are three report types in two possible flavours:

1) a report to subject librarians about the usage of library resources from visitors referred from the VLE for courses they look after

2) a report to librarians responsible for particular subscription databases showing how that resource is accessed by visitors referred from the VLE, broken down by course

3) a report to course teams showing how library resources linked to from the VLE for their course are used by visitors referred to those resources from the VLE.

The two flavours are:

a) Google analytics reports

b) custom dashboard with data accessed via the Google Analytics API

Recommendations will also be made based on the extent to which Library website usage by anonymous students on particular OU courses may be tracked by other means, such as affinity strings in the SAMS cookie, and the benefits that may accrue from this more comprehensive form of tracking.

If course team members on any OU courses presenting over the next 9 months are interested in how students are using the library website following a referral from the VLE, please get in touch. If academics on courses outside the OU would like to discuss the use of Google Analytics in an educational context, I’d love to hear from you too:-)

eSTEeM is joint initiative between the Open University’s Faculty of Science and Faculty of Maths, Computing and Technology to develop new approaches to teaching and learning both within existing and new programmes.

Written by Tony Hirst

April 13, 2011 at 11:01 am

Posted in Analytics, Library, OU2.0, Project

Tagged with ,

Current JISC Projects of Possible Interest to LAK11 Attendees

Mulling over an excellent couple of days in Banff at the first Learning Analytics and Knowledge conference (LAK11; @dougclow’s liveblog notes), where we heard about a whole host of data and anlytics related activites from around the world, I thought it may be worth pulling together descriptions of several current JISC projects that are exploring related issues to add in to the mix…

There are currently at least three programmes that it seems to me are in the general area…

———————————–

Activity Data

Many systems in institutions store data about the actions of students, teachers and researchers. The purpose of this programme is to experiment with this data with the aim of improving the user experience or the administration of services.

AEIOU – Aberystwyth University – this project will gather usage statistics from the repositories of all Higher Education Institutions in Wales and use this data to present searchers who discover paper from a Welsh repository with recommendations for other relevant papers that they may be interested in. All of this data will be gathered into a research gateway for Wales.

Agtivity – University of Manchester – this project will collect usage data from people using the Advanced Video Conferencing services supported by the Access Grid Support Centre. This data will be used evaluate usage more accurately, in terms of the time the service is used, audience sizes and environmental impact, and will be used to drive an overall improvement in Advanced Video Conferencing meetings through more targetted support by the Access Grid Support Centre staff of potentially failing nodes and meetings.

Exposing VLE Data – University of Cambridge – a project that will bring together activity and attention data for Cambridge’s institutional virtual learning environment (based on the Sakai software) to create useful and informative management reporting including powerful visualisations. These reports will enable the exploration of improvements to both the VLE software and to the institutional support services around it, including how new information can inform university valuation of VLEs and strategy in this area. The project will also release anonymised datasets for use in research by others.

Library Impact Data – Huddersfield University – the aim of this project is to prove a statistically significant correlation between library usage and student attainment. The project will collect anonymised data from University of Bradford, De Montfort University, University of Exeter, University of Lincoln, Liverpool John Moores University, University of Salford, Teesside University as well as Huddersfield. By identifying subject areas or courses which exhibit low usage of library resources, service improvements can be targeted. Those subject areas or courses which exhibit high usage of library resources can be used as models of good practice.

RISE – Open University – As a distance-learning institution, students, researchers and academics at the Open University mainly access the rich collection of library resources electronically. Although the systems used track attention data this data isn’t used to help users search. RISE aims to exploit the unique scale of the OU (with over 100,000 annual unique users of e-resources) by using attention data recorded by EZProxy to provide recommendations to users of the EBSCO Discovery search solution. RISE will then aim to release that data openly so it can be used by the community.

Salt – University of Manchester – SALT will experiment with 10 years of library circulation data from the John Rylands University Library to support humanities research by making underused “long tail” materials easier to find by library users. The project will also develop an api to enable others to reuse the circulation data and will explore the possibility of offering the api as a national shared service.

Shared OpenURL Data – EDINA – This is an invited proposal by JISC which takes forward the recommendations made in scopingactivity related to collection and use of OpenURL data that might be available from institutionalOpenURL resolvers and the national OpenURL router shared service which was funded between December 2008 – April 2009 by JISC. The work will be done in two stages: an initial stage exploring the steps required to make the data available openly, followed by making the data available and implementation of prototype service(s) using the data.

STAR-Trak – Leeds Metropolitan University – This project will provide an application (STAR-Trak:NG) to highlight and manage interventions with students who are at risk of dropping out, identified primarily by mining student activity data held in corporate systems.

UCIAD – Open University – UCIAD will investigate the use of semantic technologies for integrating user activity data from different systems within a University. The objective is to scope and prototype an open, pluggable software framework based on such semantic models, aggregating logs and other traces from different systems as a way to produce a comprehensive and meaningful overview of the interactions between individual users and a university.

See also:

The JISC RAPTOR project is investigating ways to explore usage of e-resources.

PIRUS is a project investigating the extension of Counter statistics to cover article level usage of electronic journals.

The Journal Usage Statistics Portal is a project that is developing a usage statistics portal for libraries to manage statistics about electronic journal usage.

The Using OpenURL activity data project will take forward the recommendations of the Shared OpenURL Data Infrastructure Investigation to further explore the value and viability of releasing OpenURL activity data for use by third parties as a means of supporting development of innovative functionality that serves the UK HE community.

The major influences on the Activity Data programme has been the JISC Mosaic project*final report) and the Gaining Intelligence event (final report).

———————————–

Business Intelligence

The Business Intelligence Programme is funded by JISC’s Organisational Support committee in line with its aim to work with managers to enhance the strategic management of institutions and has funded projects to further explore the issues encountered within institutions when trying to progress BI. (See also JISC’s recently commissioned study into the information needs of senior managers and current attitudes towards and plans for BI.)

Enabling Benchmarking Excellence – Durham University – This project proposes to gather a set of metadata from Higher Education institutions that will allow the current structures within national data sets to be mapped to department structures within each institution. The eventual aim is to make comparative analysis far more flexible and useful to all stakeholders within the HE community. This is the first instance where such a comprehensive use of meta-data to tie together disparate functional organisations has been utilised within the sector, making the project truly innovative.

BIRD – Business Intelligence Reporting Dashboard – UCLAN – Using the JISC InfoNet BI Resource for guidance, this project will work with key stakeholders to re-define the processes that deliver the evidence base to the right users at the right time and will subsequently develop the BI system using Microsoft SharePoint to deliver the user interface (linked to appropriate data sets through the data warehouse). We will use this interface to simplify the process for requesting data/analysis and will provide personalisation facilities to enable individuals to create an interface that provides the data most appropriate to their needs.

Bolt-CAP – University of Bolton – Using the requirements of HEFCE TRAC as the base model, the JISC Business Intelligence Infokit and an Enterprise Architecture approach, this project will consider the means by which effective data capture, accumulation, release and reuse can both meet the needs of decision support within the organisation and that of external agencies.

Bringing Corporate Data to Life – University of East London – The aim of the project is to make use of the significant advances in software tools that utilise in-memory technologies for the rapid development of three business intelligence applications (Student Lifecycle, Corporate Performance and Benchmarking). Information in each application will be presented using a range of fully interactive dashboards, scorecards and charts with filtering, search and drill-down and drill-up capabilities. Managers will be engaged throughout the project in terms of how information is presented, the design of dashboards, scorecards and reports and the identification of additional sources of data.

Business Intelligence for Learning About Our Students – University of Sheffield – The goal of this project is develop a methodology which will allow the analysis of the data in an aggregate way, by integrating information in different archives and enabling users to query the resulting archive knowledge base from a single point of access. Moreover we aim to integrate the internal information with publically available data on socio-economic indicators as provided by data.gov.uk. Our aims are to study, on a large scale, how student backgrounds impact their future academic achievements and to help the University devise evidence informed policies, strategies and procedures targeted to their students.

Engage – Using Data about Research Clusters to Enhance Collaboration – University of Glasgow – The Engage project will integrate, visualise and automate the production of information about research clusters at the University of Glasgow, thereby improving access to this data in support of strategic decision making, publicity, enhancing collaboration and interdisciplinary research, and research data reporting.

IN-GRiD – University of Manchester, Manchester Business School – The project addresses the process of collection, management and analysis of building profile data, building usage data, energy consumption data, room booking data, IT data and the corresponding financial data in order to improve the financial and environmental decision making processes of the University of Manchester through the use of business intelligence. The main motivation for the project is to support decision making activities of the senior management of the University of Manchester in the area of sustainability and carbon emissions management.

Liverpool University Management Information System (LUMIS) – Liverpool University – The University has identified a need for improved Management Information to support performance measurement and inform decision making. MI is currently produced and delivered by a variety of methods including standalone systems and spreadsheets. … The objectives of LUMIS are to design and implement an MI solution, combining technology with data integrity, business process improvement and change management to create a range of benefits.

RETAIN: Retaining Students Through Intelligent Interventions – Open University – The focus will be on using BI to improve student retention at the Open University. RETAIN will make it possible to: include additional datasources with existing statistical methods; use predictive modelling to identify ‘at risk’students.

Supporting institutional decision making with an intelligent student engagement tracking system – University of Bedfordshire – This project aims to demonstrate how the adoption of a student engagement tracking system (intelligent engagement) can support and enhance institutional decision making with evidence in three business intelligence (BI) data subject categories: student data and information, performance measurement and management, and strategic planning.

Visualisation of Research Strength (VoRS) – University of Huddersfield – Many HEIs now maintain repositories containing their researchers‟ publications. They have the potential to provide much information about the research strength of an HEI, as publications are the main output of research. The project aims to merge internal information extracted from an institution‟s publications repository with external information (academic subject definitions, quality of outlets and publications), for input to a visualisation tool. The tool will assist research managers in making decisions which need to be based on an understanding of research strengths across subject areas, such as where to aim internal investment. In the event that the tool becomes a part of a BI resource, It could lead to institution vs institution comparisons and visual benchmarking for research.

———————————–

Infrastructure for Resource Discovery

(IMHO, if resource recommendation can be improved by the application of “learning analytics”, we’ll be needing metadata used to describe those resources as well as the activity data generated around their use…)

In 2009 JISC and RLUK convened a group of Higher Education library, museum and archive experts to think about what national services were required for supporting online discovery and reuse of collection metadata. This group was called the resource discovery taskforce (RDTF) and … produced a vision and an implementation plan focused on making metadata about collections openly available therefore supporting the development of flexible and innovative services for end users. … This programme of projects has been funded to begin to address the challenges that need to be overcome at the institutional level to realise the RDTF vision. The projects are focused on making metadata about library, museum and archive collections openly available using standards and licensing that allows that data to be reused.

Comet – Cambridge University – The COMET project will release a large sub-set of bibliographic data from Cambridge University Library catalogues as open structured metadata, testing a number of technologies and methodologies including XML, RDF, SPARQL and JSON. It will investigate and document the availability of metadata for the library’s collections which can be released openly in machine-readable formats and the barriers which prevent other data from being exposed in this way. [Estimated amount of data to be made available: 2,200,000 metadata records]

Connecting repositories – Open University – The CORE project aims to make it easier to navigate between relevant scientific papers stored in Open Access repositories. The project will use Linked Data format to describe the relationships between papers stored across a selection of UK repositories, including the Open University Open Research Online (ORO). A resource discovery web-service and a demonstrator client will be provided to allow UK repositories to embed this new tool into their own repository. [Estimated amount of data to be made available: Content of 20 repositories, 50,000 papers, 1,000,000 rdf triples]

Contextual Wrappers – Cambridge University – The project is concerned with the effectiveness of resource discovery based on metadata relating to the Designated collections at the Fitzwilliam Museum in the University of Cambridge and made available through the Culture Grid, an aggregation service for museums, libraries and archives metadata. The project will investigate whether Culture Grid interface and API can be enhanced to allow researchers to explore hierarchical relationships between collections and the browsing of object records within a collection [Estimated amount of data to be made available: 164,000 object records (including 1,000 new/enhanced records), 74,800 of them with thumbnail images for improved resource discovery]

Discovering Babel – Oxford University – The digital literary and linguistic resources in the Oxford Text Archive and in the British National Corpus have been available to researchers throughout the world for several decades. This project will focus on technical enhancements to the resource discovery infrastructure that will allow wider dissemination of open metadata, will facilitate interaction with research infrastructures and the knowledge and expertise achieved will be shared with the community. [Estimated amount of data to be made available: 2,000 literary and linguistic resources in electronic form]

Jerome – University of Lincoln – Jerome began in the summer of 2010, as an informal ‘un-project’, with the aim of radically integrating data available to the University of Lincoln’s library services and offering a uniquely personalised service to staff and students through the use of new APIs, open data and machine learning. This project will develop a sustainable, institutional service for open bibliographic metadata, complemented with well documented APIs and an ‘intelligent’, personalised interface for library users. [Estimated amount of data to be made available: ~250,000 bibliographic record library catalogue, along with constantly expanding data about our available journals and their contents augmented by the Journal TOCs API, and c.3,000 additional records from our EPrints repository]

Open Metadata Pathfinder – King’s College London – The Open Metadata Pathfinder project will deliver a demonstrator of the effectiveness of opening up archival catalogues to widened automated linking and discovery through embedding RDFa metadata in Archives in the M25 area (AIM25) collection level catalogue descriptions. It will also implement as part of the AIM25 system the automated publishing of the system’s high quality authority metadata as open datasets. The project will include an assessment of the effectiveness of automated semantic data extraction through natural language processing tools (using GATE) and measure the effectiveness of the approach through statistical analysis and review by key stakeholders (users and archivists).

Salda – Sussex University – The project will extract the metadata records for the Mass Observation Archive from the University of Sussex Special Collection’s Archival Management System (CALM) and convert them in to Linked Data that will be made publicly available. [Estimated amount of data to be made available: This project will concentrate on the largest archival collection held within the Library, the Mass Observation Archive, potentially creating up to 23,000 Linked Data records.]

OpenArt – York University – OpenART, a partnership between the University of York, the Tate and technical partners, Acuity Unlimited, will design and expose linked open data for an important research dataset entitled “The London Art World 1660-1735″. Drawing on metadata about artists, places and sales from a defined period of art history scholarship, the dataset offers a complete picture of the London art world during the late 17th and early 18th centuries. Furthermore, links drawn to the Tate collection and the incorporation of collection metadata will allow exploration of works in their contemporary locations. The process will be designed to be scalable to much richer and more varied datasets, both at York, Tate and beyond.

See also:
- Linked Open Copac Archives Hub
- Linking University Content for Education and Research Online
- Openbib

I need to find a way of representing the topic areas and interconnections between these projects somehow!

See also this list of projects in the above programmes [JSON] which may be a useful starting point if you need a list of project IDs. I think the short name attribute can be used to identify the project description HTML page name at the end of an appropriate programme path?

Written by Tony Hirst

March 3, 2011 at 12:50 am

Posted in Analytics, Anything you want

Tagged with ,

Corporate Data Analyst and Online Comms Jobs at the OU

Though I’m sure these sorts of job have been being advertised for years, it’s interesting tracking how they’re being represented at the moment, and the sorts of skills required.

Corporate Data and MI Analyst, Marketing (£29,853 – £35,646)

Main Purpose of the Post:
The post holder is a member of the Campaign Planning and Data team and will be required to play a pro-active role in that team, balancing the needs and recommendations of their own areas of responsibility with the wider needs and priorities of the team and the whole Marketing & Sales Unit.

This post has been constructed to assist the University to develop its marketing capacity so that challenging targets can be met. It will be essential for the post holder to work to harness the energies of academic and academic related staff in the University’s academic units, service units and regions to develop a more effective marketing strategy. This will require influencing and networking skills and an ability to adapt engagement style to an academic context.

The post holders work within a team producing Campaign Plans for both new and continuing students. The plans drive the allocation of over £10M of promotional activities (acquisition and retention campaigns).

Description of Duties of the Post:

Contribute to optimising the University’s customer targeting capability via regular reappraisal of segmentation policy with a view to increasing market share in high yield segments
Contribute to development and delivery of robust models, tools, skills and resources to enable segmentation, competitor and market analysis and data mining within the Campaign Planning Team and more widely within Marketing and Sales.

Planning 60%
Input into overall marketing plans and support planning process.
Segment the prospect data mart by developing key prospect indicators to provide Response, Reservation, Registration, Retention and other key metric predictions for each.
Support quantification of product performance predictions to provide Response, Reservation, Registration, Retention and other key metric predictions for each.
Maintain and contribute to development of a targeting model, which overlays product performance predictions/actual by segment over the agreed marketing plan to provide a targeting matrix.
Communicate targeting matrix to stakeholders and overlay tests and current campaign activity to provide an agreed campaign plan based on minimising Cost per Registration and maximising marketing mix and integration strategy.
Monitor performance daily and update segmentation, product and targeting models to maintain a data driven test and learn cycle. Identify significant deviations from forecast and potential actions.
Continually review the Customer Journey through input into creation of a Retention model based on a balanced scorecard approach. Work with key stakeholders to prioritise and implement developments.
Input into model validation and quality control.

Data 20%
Support development of a marketing data mart to primarily support marketing analysis and campaign execution.
Provide input into marketing data developments encouraging sharing of data and best practise.
Support development of in-house tools and processes to improve marketing analysis and campaign execution, primarily SAS and SIEBEL. Support other areas in evaluating tools and systems.
Where appropriate, maintain the relationship with OU data providers ensuring relevant data processing, development, quality and SLA’s are controlled.
Promote data use within marketing and other OU areas, maximising the use of data and providing a hub for data developments to be controlled

MI 20%
Input into development of key performance measures to be used across the OU.
Develop relationships with key OU stakeholders to ensure common goals are met.
Facilitate the use of marketing data across the OU and develop tools to support.
Support data focused research and tests with analytical input.
Input into development and maintenance of campaign performance measures.

Person Spec – Essential
Substantial experience in a campaign planning, analysis or similar role including for
example; campaign execution, data extraction, the development of data infrastructure.
Experience of Direct Marketing.
Experience of B2B and / or B2C marketing.
Experience of data propensity and segmentation modelling.
A balance of marketing analysis and technical skills, including data quality and protection.
Experience of test and learn data driven analysis, targeting processes and systems;
Proven ability to see trends in data and drill down to issues or key data.
Proven ability to develop relationships with key decision makers and stakeholders.
Proven ability to translate marketing requirements into planning / execution requirements.
Excellent presentation and facilitation skills.
Provide analytic support and direction to colleagues to ensure understanding.
Proven ability to meet challenging deadlines without compromising quality.

Still no adverts* for a “learning data analyst” though, tasked with analysing data to see:

- whether effective use is being made of linked to resources, particularly subscription Library content and open educational resources;

- whether there’s anything in the student activity data and/or social connection data we can use to predict attainment and/or satisfaction levels or improve retention levels.

* That said, IET do collect lots of stats, and I think a variety of stats are now available relating to activity on the Moodle VLE. I’m not sure who does what with all that data though…?

PS I wonder if any of the analysts that companies like Pearson presumably employ look to model ways of maximising the profitabilty, to those companies, of student acquisition and retention, given education is their business? (See also: Apollo Group results – BPP and University of Phoenix, Publishing giant Pearson looks set to offer degrees).

PPS This job ad may also be of interest to some? Online Communications Officer, Open University Business School (£29,853 – £35,646)

Again, it’s interesting to mark what’s expected…

This brand new role in the School will drive the development of online communications. Focusing on increasing engagement and traffic through the website, you will ensure this work is appropriately integrated into the wider work of the University’s Online Services, Marketing and Communications teams. Reporting to the Director of Business Development and External Affairs, it will be your responsibility to develop the website including content, usability, optimisation, interactivity and driving increased visitor numbers and online registration. You will continually find new and inventive ways to engage with our stakeholders and promote the reputation of the Business School through the online channel.

Your responsibilities will also extend to the School’s virtual presence through social networks, iTunes U and YouTube and utilise these channels to our advantage. You will increase our presence as well as delivering virtual campaigns to improve the overall student numbers. In this role, it will also be your responsibility to develop relationships with other areas of the University engaged in this work and will play a key role in the management of these relationships.

Summary of Duties
The main duties of the Online Communications Officer are detailed below.

• Advance the social media strategy ensuring it is inline with the Universities media position, market response and the development of new technology.
• Manage the online activity of the Business School’s social media communities
• Liaise, as appropriate, with units within The Open University, such as Online Services to keep up to date with policy changes and AACS regarding technical developments.
• Liaise with the Business School’s Information Officer for the maintenance and feeding of the Research Database into the website
• Generate assets to host on the website e.g. an Elluminate Demo Video
• Keep abreast of trends and developments to ensure that the Business School’s online presence remains at the forefront
• Work alongside Online Services, to monitor the visitor traffic of the website and establish appropriate and effective KPIs for dissemination across the Business School, for example through the creation of a dashboard.
• Engage in personal development based on organisational needs and developments to foster a high level of professional skills and technical ability
• Ensure that corporate branding and media guidelines are adhered to
• Understand and appreciate internal procedures and standards and be proactive in recommending improvements
• Ability to apply best professional practice to deliver effective solutions that take into account technical, budgetary and other project considerations
• Edit the content of both the internet and intranet
• Collate, interpret and select key information for dissemination on the latest trends and research in social media both within the OU and externally
• Produce graphics where necessary or liaise with designers in the University or outside agencies to produce graphics.
• Create/collate digital assets including audio and video files
• Post moderate discussion forums
• Disseminate best practice through a variety of communications channels eg project website, OU Life news, brief updates etc.
• Develop and maintain awareness of different audience needs in relation to appropriate communications channels (eg email, screensaver, website, print).
• Act as a flexible member of the Business Development and External Affairs team.
• Carry out other tasks as specified from time to time by the Project Director

Related: Joanne Jacobs’ Are you social or anti-social?: “How to employ a Social Media Strategist, and how you should measure their performance. (Social media isn’t going away. But some Social Media Strategists should go away.)”

Written by Tony Hirst

January 27, 2011 at 11:51 am

Posted in Analytics, Jobs

Matplotlib: Detrending Time Series Data

Reading the rather wonderful Data Analysis with Open Source Tools (if you haven’t already got a copy, get one… NOW…), I noticed a comment that autocorrelation “is intended for time series that do not exhibit a trend and have zero mean”. Doh! Doh! And three times: doh!

I’d already come the same conclusion, pragmatically, in Identifying Periodic Google Trends, Part 1: Autocorrelation and Improving Autocorrelation Calculations on Google Trends Data but now I’ll be remembering this as a condition of use;-)

One issue I had come across with trying to plot a copy of the mean zero and/or detrended data as calculated using Matplotlib was how to plot a copy of the detrended data directly. (I’d already worked out how to use the detrend_ filters in the autocorrelation function).

The problem I had was simply trying plot mlab.detrend_linear(y) as applied to list of values y threw an error (“AttributeError: ‘list’ object has no attribute ‘mean’”). It seems that detrend expects y=[1,2,3] to have a method y.mean(); which it doesn’t, normally…

The trick appears to be that matplotlib prefers to use something like a structured array, rather than a simple list, which offers these additional methods. Biut how is the data structured? A simple Google wasn’t much help, but a couple of cribs suggested that casting the list to y=np.array(y) (where import numpy as np) might be a good idea.

So let’s try it:

import matplotlib.pyplot as plt
import numpy as np

label='run'
d=[0.99,0.98,0.95,0.93,0.91,0.93,0.92,0.95,0.95,0.94,0.96,0.98,0.97,1.00,1.01,1.05,1.06,1.06,0.98,0.98,0.98,0.97,0.96,0.93,0.93,0.96,0.95,1.05,0.97,0.95,1.01,1.02,0.98,1.01,0.98,1.00,1.06,1.04,1.06,1.04,0.97,0.94,0.92,0.90,0.87,0.88,0.85,0.90,0.91,0.87,0.88,0.88,0.91,0.91,0.88,0.91,0.92,0.91,0.90,0.92,0.87,0.92,0.92,0.92,0.94,0.97,0.99,1.01,1.01,1.04,0.97,0.94,0.98,0.94,0.98,0.91,0.93,0.92,0.95,1.00,0.93,0.93,0.96,0.96,0.96,0.97,0.95,0.95,1.06,1.12,1.01,1.00,0.99,0.98,0.96,0.93,0.91,0.92,0.92,0.94,0.94,0.94,0.90,0.86,0.89,0.93,0.90,0.90,0.90,0.90,0.89,0.92,0.91,0.92,0.93,0.93,0.94,0.99,0.98,0.99,1.01,1.06,1.06,0.96,0.98,0.92,0.92,0.93,0.91,0.90,0.93,1.02,0.90,0.93,0.91,0.93,0.95,0.93,0.91,0.92,0.96,0.93,1.02,1.02,0.91,0.88,0.87,0.87,0.84,0.82,0.82,0.84,0.83,0.85,0.80,0.80,0.87,0.85,0.83,0.80,0.84,0.83,0.84,0.88,0.83,0.88,0.88,0.86,0.91,0.93,0.91,0.97,0.96,1.00,1.01,0.98,0.94,0.97,0.94,0.95,0.92,0.93,0.97,1.02,0.95,0.92,0.91,0.95,0.93,0.94,0.91,0.92,0.98,0.99,0.97,0.98,0.90,0.86,0.87,0.91,0.87,0.86,0.86,0.89,0.89,0.87,0.86,0.83,0.85,0.86,0.90,0.87,0.87,0.90,0.89,0.93,0.93,0.97,0.99,0.95,1.00,1.05,1.03,1.04,1.08,1.05,1.05,1.05,1.05,1.01,1.07,1.02,1.02,1.04,1.00,1.04,1.17,1.03,1.01,1.02,1.05,1.06,1.05,0.99,1.07,1.03,1.05,1.07,1.04,0.97,0.94,0.97,0.93,0.94,0.96,0.96,1.04,1.05,1.04,0.96,1.00,1.04,1.01,1.00,0.99,0.99,0.99,1.03,1.05,1.02,1.06,1.07,1.04,1.16,1.19,1.12,1.18,1.19,1.16,1.12,1.12,1.09,1.12,1.11,1.12,1.06,1.05,1.14,1.26,1.09,1.12,1.13,1.16,1.18,1.22,1.17,1.24,1.28,1.35,1.19,1.16,1.11,1.11,1.13,1.13,1.11,1.09,1.06,1.07,1.09,1.09,1.03,1.05,1.04,1.04,1.03,1.03,1.06,1.09,1.17,1.12,1.11,1.14,1.20,1.18,1.24,1.19,1.21,1.22,1.22,1.27,1.25,1.18,1.15,1.18,1.17,1.11,1.09,1.10,1.12,1.26,1.15,1.15,1.16,1.16,1.15,1.12,1.15,1.14,1.20,1.31,1.17,1.18,1.14,1.15,1.14,1.12,1.17,1.11,1.10,1.11,1.14,1.10,1.08,1.06]

fig = plt.figure()
da=np.array(d)

ax1 = fig.add_subplot(211)
ax1.plot(da)

ax2 = fig.add_subplot(211)
y= mlab.detrend_linear(da)
ax2.plot(y)

ax3 = fig.add_subplot(211)
ax3.plot(da-y)

Here’s the result:

The top, ragged trace is the original data (in the d list); the lower trace is the same data, detrended; the straight line is the line that is subtracted from the original data to produce the detrended data.

The lower trace would be the one that gets used by the autocorrelation function using the detrend_linear setting. (To detrend based on simply setting the mean to zero, I think all we need to do is process da-da.mean()?

UPDATE: One of the problems with detrending the time series data using the linear trend is that the increasing trend doesn’t appear to start until midway through the series. Another approach to cleaning the data is to use remove the mean and trend by using the first difference of the signal: d(x)=f(x)-f(x-1). It’s calculated as follows:

#time series data in d
#first difference
fd=np.diff(d)

Here’s the linearly detrended data (green) compared to the first difference of the data (blue):

Note that the length of the first difference signal is one sample less than the orginal data, and shifted to the left one step. (There’s presumably a numpy way of padding the head or tail of the series, though I’m not sure what it is yet!)

Here’s the autocorrelation of the first difference signal – if you refer back to the previous post, you’ll see it’s much clearer in this case:

It is possible to pass an arbitrary detrending function into acorr, but I think it needs to return an array that is the same length as the original array?

So what next? Looking at the original data, it is quite noisy, with some apparently obvious to the eye trends. The diff calculation is quite sensitive to this noise, so it possibly makes sense to smooth the data prior to calculating the first difference and the autocorrelation. But that’s for next time…

Written by Tony Hirst

January 15, 2011 at 5:33 pm

Posted in Analytics, Data, Visualisation

Tagged with

Social Networks on Delicious

One of the many things that the delicious social networking site appears to have got wrong is how to gain traction from its social network. As well as the incidental social network that arises from two or more different users using the same tag or bookmarking the same resource (for example, Visualising Delicious Tag Communities Using Gephi), there is also an explicit social network constructed using an asymmetric model similar to that used by Twitter: specifically, you can follow me (become a “fan” of me) without my permission, and I can add you to my network (become a fan of you, again without your permission).

Realising that you are part of a social network on delicious is not really that obvious though, nor is the extent to which it is a network. So I thought I’d have a look at the structure of the social network that I can crystallise out around my delicious account, by:

1) grabbing the list of my “fans” on delicious;
2) grabbing the list of the fans of my fans on delicious and then plotting:
2a) connections between my fans and and their fans who are also my fans;
2b) all the fans of my fans.

(Writing “fans” feels a lot more ego-bollox than writing “followers”; is that maybe one of the nails in the delicious social SNAFU coffin?!)

Here’s the way my “fans” on delicious follow each other (maybe? I’m not sure if the fans call always grabs all the fans, or whether it pages the results?):

The network is plotted using Gephi, of course; nodes are coloured according to modularity clusters, the layout is derived from a Force Atlas layout).

Here’s the wider network – that is, showing fans of my fans:

In this case, nodes are sized according to betweenness centrality and coloured according to in-degree (that is, the number of my fans who have this people as fans). [This works in so far as we're trying to identify reputation networks. If we're looking for reach in terms of using folk as a resource discovery network, it would probably make more sense to look at the members of my network, and the networks of those folk...)

If you want to try to generate your own, here's the code:

import simplejson

def getDeliciousUserFans(user,fans):
  url='http://feeds.delicious.com/v2/json/networkfans/'+user
  #needs paging? or does this grab all the fans?
  data = simplejson.load(urllib.urlopen(url))
  for u in data:
    fans.append(u['user'])
    #time also available: u['dt']
  #print fans
  return fans

def getDeliciousFanNetwork(user):
  f=openTimestampedFile("fans-delicious","all-"+user+".gdf")
  f2=openTimestampedFile("fans-delicious","inner-"+user+".gdf")
  f.write(gephiCoreGDFNodeHeader(typ="min")+"\n")
  f.write("edgedef> user1 VARCHAR,user2 VARCHAR\n")
  f2.write(gephiCoreGDFNodeHeader(typ="min")+"\n")
  f2.write("edgedef> user1 VARCHAR,user2 VARCHAR\n")
  fans=[]
  fans=getDeliciousUserFans(user,fans)
  for fan in fans:
    time.sleep(1)
    fans2=[]
    print "Fetching data for fan "+fan
    fans2=getDeliciousUserFans(fan,fans2)
    for fan2 in fans2:
      f.write(fan+","+fan2+"\n")
      if fan2 in fans:
        f2.write(fan+","+fan2+"\n")
  f.close()
  f2.close()

So what"s the next step...?!

Written by Tony Hirst

January 13, 2011 at 2:58 pm

Posted in Analytics

Tagged with , , ,

Improving Autocorrelation Calculations on Google Trends Data

In Identifying Periodic Google Trends, Part 1: Autocorrelation, I described how to calculate the autocorrelation statistic for Google Trend data using matplotlib. One the hacks that I found was required in order to calculate an informative autocorrelogram was to subtract the mean signal value from the original signal before running the calculation.

A more pathological situation occurs in the following case, using the Google Trends data for “run”:

Visual inspection of the original trend data suggests there is annual periodicity (note to self: learn how to add vertical gridlines at required points using matplotlib;-):

However, the autocorelogram does not detect the periodicity for two reasons: firstly, as with the previous cases, the non-zero mean value of the original time series data means the periodic excursions are attenuated in the autocorrelation calculation compared to excursions form a mean zero; and secondly, the increasing trend of the data adds further confusion to the year on year comparisons used in autocorrelation calculation.

Googling around remove trend and matplotlib turned up a detrend function that looked like it could help clean the data used for the autocorrelation calculation. In fact, the detrend function is mentioned in the acorr autocorrelation function documentation, although no details of values the function can take are provided there. However, searching the rest of that documentation page for detrend does turn up valid values for the function: detrend=mlab.detrend_mean, and mlab.detrend_linear, mlab.detrend_none where import matplotlib.mlab as mlab

If we set the detrend processor to mlab.detrend_mean we get the following:

And with detrend set to mlab.detrend_linear we get:

In each of these latter two cases, we see evidence of the 52 week correlation (i.e. annual periodicity).

FWIW, here’s the gist for the modified code.

Written by Tony Hirst

January 6, 2011 at 10:51 am

Posted in Analytics, Data

Tagged with

Identifying Periodic Google Trends, Part 1: Autocorrelation

One of the thing many things we’re all pretty good, partly because of the way we’re wired, is spotting visual patterns. Take the following image, for example, which is taken from Google Trends and shows relative search volume for the term “flowers” over the last few years:

The trend shows annual periodic behaviour (the same thing happens every year), with a couple of significant peaks showing heavy search volumes around the term on two separate occasions, a lesser blip between them and a small peak just before Christmas; can you guess what these occasions relate to?;-) The data itself can be downloaded in a tatty csv file from the link at the bottom left of the page (tatty because several distinct CSV data sets are contained in the CSV file, separated by blank lines.) The sampling frequency is once per week.

The flowers trace actually holds a wealth of secrets – behaviours vary across UK and the US, for example – but for now I’m going to ignore that detail (I’ll return to it in a later post). Instead, I’m just going to (start) asking a very simple question – can we automatically detect the periodicity in the trend data?

Way back when, my first degree was electronics. Many of the courses I studied related to describing in mathematical terms the structure of “systems” and the analysis of the structure of signals; ideal grounding for looking at time series data such as the Google Trends data, and web analytics data.

Though I’ve since forgotten much of what I’ve studied then, I can remember the names of many of the techniques and methods, if not how to apply them. So one thing I intend to do over the next quarter is something of a refresher in signal processing/time series analysis (which is to say, I would appreciate comments on at least three counts: firstly, if I make a mistake, please feel free you are obliged to point it out; secondly, if I’m missing a trick, or an alternative/better way of achieving a similar or better end, please point it out; thirdly, the approach I take will be rediscovering the electronics/engineering take on this sort of analysis. Time series analysis is also widely used in biology, economics etc etc, though the approach or interpretation taken in different disciplines may be different* – if you can help bridge my (lack of) engineering understanding with a biological or economic perspective/interpretation, please do so;-)

(*I discovered this in my PhD, when I noticed that the equations used to describe evolution in genetic populations in discrete and continuous models were the same as equations used to describe different sorts of low pass filters in electronics; which means that under the electronics inspired interpretation of the biological models, we could by inspection say populations track low frequency components (components with a periodicity over 10s of generations) and ignore high frequency components. The biologists weren’t interested…)

To start with, let’s consider the autocorrelation of the trend data. Autocorrelation measures the extent to which a signal is correlated with (i.e. similar to) itself over time. Essentially, it is calculated from the product of the signal at each sample point with a timeshifted version of itself. (Wikipedia is as good as anywhere to look up the formal definition of autocorrelation.)

I used the Python matplotlib to calculate the autocorrelation using this gist. The numbers in the array are the search volume values exported from Google Trends.

The top trace shows the original time series data – in this case the search volume (arbitrary units) of the term “flowers” over the last few years, with a sample frequency of once per week.

The second trace is the autocorrelation, over all timeshifts. Whilst there appear to be a couple of peaks in the data, it’s quite hard to read, because the variance of original signal is not so great. Most of the time the signal value is close to 1, with occasional excursions away from that value. However, if we subtract the average signal value from the original signal value, (finding g(t)=f(t)-MEAN(f)) and then run the autocorrelation function, we get a much more striking view of the autocorrelation of the data:

(if I’ve been really, really naughty doing this, please let me know; I also experimented with substracting the minimum value to set the floor of the signal to 0;-)

A couple of things are worth noticing: firstly, the autocorrelation is symmetrical about the origin; secondly, the autocorrelation pattern repeats every 52 weeks (52 timeshifted steps)… Let’s zoom in a bit by setting the maxlags value in the script to 53, so we can focus on the autocorrelation values over a 52 week period:

So – what does the autocorrelogram(?) tell us? Firstly, there is a periodicity over the course of the year. Secondly, there appears to be a couple of features 12 weeks or so apart (subject to a bit of jitter…). That is, there is a correlation between f(t) and f(t+12), as well as f(t) and f(t-40), (where 40=52-12…)

Here’s another trend – turkey:

Again, the annual periodicity is detected, as well as a couple of features that are four weeks apart…

How about a more regular trend – full moon perhaps?

This time, we see peaks 4 weeks apart across the year – the monthly periodicity has been detected.

Okay – that’s enough for now… there are three next steps I have in mind: 1) to have a tinker with the Google Analytics data export API and plug samples of Googalytics time series data into an autocorrelation function to see what sorts of periodic behaviour I can detect; 2) find how to drive some Fourier Transform code so I can do some rather more structured harmonic analysis on the time series data; 3) blog a bit about linear systems, and show how things like the “flowers” trend data is actually made up of several separate, well-defined signals.

But first… marking:-(

PS here’s a great review of looking at time series data for a search on “ebooks” using Google Insights for Search data using R: eBooks in Education – Looking at Trends

Written by Tony Hirst

January 5, 2011 at 5:32 pm

Follow

Get every new post delivered to your Inbox.

Join 762 other followers