OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Anything you want’ Category

From Front Running Algorithms to Bot Fraud… Or How We’ve Lost Control of the Bits…

leave a comment »

I’ve just finished reading Michael Lewis’ Flash Boys, a cracking read about algorithmic high frequency trading and how the code and communication systems that contribute to the way stock exchanges operate can be gamed by front-running bots. (For an earlier take, see also Scott Patterson’s Dark Pools; for more “official” takes, see things like the SEC’s regulatory ideas response to the flash crash of May 6, 2010, an SEC literature review on high frequency trading, or this Congressional Research Service report on High-Frequency Trading: Background, Concerns, and Regulatory Developments).

As the book describes, some of the strategies pursued by the HFT traders were made possible because of the way the code underlying the system was constructed. As Lessig pointed out way back way in Code and Other Laws of Cyberspace, and revisited in Codev2:

There is regulation of behavior on the Internet and in cyberspace, but that regulation is imposed primarily through code. The differences in the regulations effected through code distinguish different parts of the Internet and cyberspace. In some places, life is fairly free; in other places, it is more controlled. And the difference between these spaces is simply a difference in the architectures of control — that is, a difference in code.

The regulation imposed on the interconnected markets by code was gameable. Indeed, it seems that it could be argued that it was even designed to be gameable…

Another area in which the bots are gaming code structures is digital advertising. A highly amusing situation is described in the following graphic, taken from The Bot Baseline: Fraud in Digital Advertising (via http://www.ana.net/content/show/id/botfraud):


A phantom layer of “ad laundering” fake websites whose traffic comes largely from bots is used to generate ad-impression revenue. (Compare this with networks of bots on social media networks that connect to each other, send each other messages, and so on, to build up “authentic” profiles of themselves, at least in terms of traffic usage dynamics. Examples: MIT Technlogy Review on Fake Persuaders; or this preprint on The Rise of Social Bots.)

As the world becomes more connected and more and more markets become exercises simply in bit exchange, I suspect we’ll be seeing more and more of these phantom layer/bot audience combinations on the one hand, and high-speed, market stealing, front running algorithms on the other.

PS Not quite related, but anyway: how you’re being auctioned in realtime whenever you visit a website that carries ads – The Curse of Our Time – Tracking, Tracking Everywhere.

Written by Tony Hirst

April 8, 2015 at 10:05 am

Posted in Anything you want

Tagged with

Data Journalism in Practice

leave a comment »

For the last few years, I’ve been skulking round the edges of the whole “data journalism” thing, pondering it, dabbling with related tools, technologies and ideas, but never really trying to find out what the actual practice might be. After a couple of twitter chats and a phone conversation with Mark Woodward (Johnston Press), one of the participants at the BBC College of Journalism data training day held earlier this year, I spent a couple of days last week in the Harrogate Advertiser newsroom, pitching questions to investigations reporter and resident data journalist Ruby Kitchen, and listening in on the development of an investigations feature into food inspection ratings in and around the Harrogate area.

Here’s a quick debrief-to-self of some of the things that came to mind…

There’s not a lot of time available and there’s still “traditional” work to be done
One of Ruby’s takes on the story was to name low ranking locations, and contact each one that was going to be named to give them a right to response. Contacting a couple of dozen locations takes time and diplomacy (which even then seemed to provoke a variety of responses!), as does then writing those responses into the story in a fair and consistent way.

Even simple facts can take the lead in a small story
…for example, x% of schools attained the level 5 rating, something that can then also be contextualised and qualified by comparing it to other categories of establishment or national, regional or neighbouring locale averages. As a data junkie, it can be easy to count things by group, perhaps overlooking a journalistic take that many of these counts could be used as the basis of a quick filler story or space-filling, info-snack glanceable breakout box in a larger story.

Is the story tellable?
Looking at data, you can find all sorts of things that are perhaps interesting in their subtlety or detail, but if you can’t communicate a headline or what’s interesting in a few words, it maybe doesn’t fit… (Which is not to say that data reporting needs to be dumbed down or simplistic…) Related to this is the “so what?” question..? (I guess for news, if you wouldn’t share it in the pub or over dinner have read it – that is, if you wouldn’t remark on it – you’d have to ask: is it really that interesting? (Hmm… is “Liking” the same as remarking on something? I get the feeling it’s less engaged…)

There’s a huge difference between the tinkering I do and production warez

I have all manner of pseudo-workflows that allow me to generate quick sketches in an exploratory data analysis sort of way, but things that work for the individual “researcher” are not the same as things can work in a production environment. For example, I knocked up a quick interactive map using the folium library in an IPython notebook, but there are several problems with this:

  1. to regenerate the map requires someone having an IPython notebook environment set up and appropriate libraries installed
  2. there is a certain “distance” between producing a map as a single HTML file and getting the map actually published. For example, the HTML page pulls in all manner of third party files (javascript, css, image tiles, marker-icon/css-sprite image files) and so on. For example, working out whether (and if so, where) to host these various resources on a local production server so as not to inappropriately draw them down from third party server.
  3. there isn’t much time available… so you need to think about what to offer. For example:
    • the map I produced was a simple one – just markers and popups. At the time, I hadn’t worked out how to colour the markers or add new icons to them (and I still don’t have a route for putting numbers into the markers…), so the look is quite simple (and clunky)
    • there is no faceted navigation – so you can’t for example search for particular sorts of establishment or locations with a particular rating.

    Given more time, it would have been possible to consider richer, faceted navigation, for example, but for a one off, what’s reasonable? If a publisher starts to use more and more maps, one possible workflow may to be iterate on previous precedents. (To an extent, I tried to do this with things I’ve posted on the OU OpenLearn site over the years. For example, first step was to get a marker displaying map embedded, which required certain technical things being put in place the first time but could then be reused for future maps. Next up was a map with user submitted marker locations – this represented an extension of the original solution, but again resulted in a new precedent that could be reused and in turn extended or iterated on again.)

    This suggests an interesting development process in which ever richer components can perhaps be developed iteratively over an extended period of time or set of either related or independent stories, as the components are used in more and more stories. Where a media group has different independent publications, other ways of iterating are possible…

    The whole tech angle also suggests that a great stumbling block to folk getting (interactive) data elements up on a story page is not just the discovery of the data, the processing and cleaning of it, and the generation of the initial sketch to check it could be something that could add to the telling of a story, (each of which may require a particular set of skills), but also the whole raft of production related issues that then result (which could require a whole raft of other technical skills (which are, for example, skills I know that I don’t really have, even given my quirky skillset…). And if the corporate IT folk take ownership of he publication element, there is then a cascade back upstream of constraints relating to how the data needs to be treated so it can fit in with the IT production system workflow.

Whilst I tend to use ggplot a lot in R for exploring datasets graphically, rather than producing presentation graphics to support the telling of a particular story. Add to that, I’m still not totally up to speed on charting in the python context, and the result is that I didn’t really (think to) explore how graphical, chart based representations might be used to support the story. One thing that charts can do – like photographs – is take up arbitrary amounts of space, which can be a Good Thing (if you need to fill the space) or a Bad Thing (is space is at a premium, or page (print or web) layout is a constraint, perhaps due to page templating allowances, for example.

Some things I didn’t consider but that now come to mind now are:

  1. how are charts practically handed over? (As Excel charts? as image files?)
  2. does a sub-editor or web-developer then process the charts somehow?
  3. for print, are there limitations on use of colour, line thickness, font-size and style?

Print vs Web
I didn’t really consider this, but in terms of workflow and output, are different styles of presentation required for:

  • text
  • data tables
  • charts
  • maps

Many code based workflows now allow you to “style” outputs in the same way you can style web pages (eg the CSS Zen Garden sites are all visually distinct but have exactly the same content – just the style is changed; thinks: data zen garden.. hmmm… (and related: chart redesigns…). For example, in the python environment ggplot or Seaborn style charts can be styled visually using themes to generate charts that can be save as image files, for example, or converted to interactive web charts (using eg mpld3, which converts base matplotlib charts (which ggplot and seaborn generate) to d3js interactive charts); alternatively, libraries such as pandas highcharts (or in the R context, rCharts) let you generate interactive charts using well-developed javascript chart libraries.

If you want data tables, there are various libraries or tools for styling charts, but again the question of workflow and the actual form in which items are handed over for print or web publication needs to be considered.

Being right/being wrong
Every cell in a data table is a “fact”. If your code is wrong and and one column, or row, or cell is wrong, that can cause trouble. When you’re tinkering in private, that doesn’t matter so much – every cell can be used as the basis for another question that can be used to test, probe or check that fact further. If you publish that cell, and it’s wrong, you’ve made a false claim… Academics are cautious and don’t usually like to commit to anything without qualifying it further (sic;-). I trust most data, metadata and my own stats skills little enough that I see stats’n’data as a source that needs corroborating, which means showing it to someone else with my conclusions and a question along the lines of “it seems to me that this data suggests that – would you agree?”. This perhaps contrasts with relaying a fact (eg a particular food hygiene score) and taking it as-is as a trusted fact, given it was published from a trusted authoritative source, obtained directly from that source, and not processed locally, but then asking the manager of that establishment for a comment about how that score came about or what action they have taken as a result of getting it.)

I’m also thinking it’d be interesting to compare the similarities and differences between journalists and academics in terms of their relative fears of being wrong…!

Human Factors
One of things I kept pondering – and have been pondering for months – is the extent to which templated analyses can be used to create local “press release” story packs around national datasets that can be customised for local or regional use. That’s a far more substantial topic for another day, but it was put into relief last week by my reading of Nick Carr’s The Glass Cage which got me thinking about the consequences of “robot” written stories… (More about that in a forthcoming post.)

Lots of skills issues, lots of process and workflow issues, lots of story discovery, story creation, story telling and story checking issues, lots of production constraints, lots of time constraints. Fascinating. Got me really excited again about the challenges of, and opportunities for, putting data to work in a news context…:-)

Thanks to all at the Harrogate Advertiser, in particular Ruby Kitchen for putting up with my questions and distractions, and Mark Woodward for setting it all up.

Written by Tony Hirst

March 25, 2015 at 2:48 pm

Posted in Anything you want

Tagged with ,

Software Apps As Independent, Free Running, Self-Contained Services

with 4 comments

The buzz phrase for elements of this (I think?) is microservices or microservice architecture (“a particular way of designing software applications as suites of independently deployable services”, [ref.]) but the idea of being able to run apps anywhere (yes, really, again…!;-) seems to have been revitalised by the recent excitement around, and rapid pace of development of, docker containers.

Essentially, docker containers are isolated/independent containers that can be run in a single virtual machine. Containers can also be linked together within so that they can talk to each other and yet remain isolated from other containers in the same VM. Containers can also expose services to the outside world.

In my head, this is what I think various bits and pieces of it look like…


A couple of recent announcements from docker suggest to me at least one direction of travel that could be interesting for delivering distance education and remote and face-to-face training include:

  • docker compose (fig, as was) – “with Compose, you define your application’s components – their containers, their configuration, links, volumes, and so on – in a single file, then you can spin everything up with a single command that does everything that needs to be done to get your application running.”
  • docker machine“a tool that makes it really easy to go from ‘zero to Docker’. Machine creates Docker Engines on your computer, on cloud providers, and/or in your data center, and then configures the Docker client to securely talk to them.” [Like boot2docker, but supports cloud as well?]
  • Kitematic UI“Kitematic completely automates the Docker installation and setup process and provides an intuitive graphical user interface (GUI) for running Docker containers on the Mac.” ) [Windows version coming soon]

I don’t think there is GUI support for configuration management provided out of docker directly, but presumably if they don’t buy up something like panamax they’ll be releasing their own version of something similar at some point soon?!

(With the data course currently in meltdown, I’m tempted to add a bit more to the confusion by suggesting we drop the monolithic VM approach and instead go for a containerised approach, which feels far more elegant to me… It seems to me that with a little bit of imagination, we could come up with a whole new way of supporting software delivery to students. eg an OU docker hub with an app container for each app we make available to students, container compositions for individual courses, a ‘starter kit’ DVD (like the old OLA CD-ROM) with a local docker hub to get folk up and running without big downloads etc etc. ..) It’s unlikely to happen of course – innovation seems to be too risky nowadays, despite the rhetoric…:-(

As well as being able to run docker containers locally or in the cloud, I also wonder about ‘plug and play’ free running containers that run on a wifi enabled Raspberry Pi that you can grab off the shelf, switch on, and immediately connect to? So for example, a couple of weeks ago Wolfram and Raspberry announced the Wolfram Language and Mathematica on Raspberry Pi, for free [Wolfram’s Raspberry Pi pages]. There are also crib sheets for how to run docker on a Raspberry Pi (the downside of this being that you need ARM based images rather than x86 ones), which could be interesting?

So pushing the thought a bit further, for the mythical submariner student who isn’t allowed to install software onto their work computer, could we give them a Raspberry Pi running their OU course software as service they could remotely connect to?!

PS by the by, at the Cabinet Office Code Club I help run for Open Knowledge last week, we had an issue with folk not being able to run OpenRefine properly on their machines. Fortunately, I’d fired up a couple of OpenRefine containers on a cloud host so we could still run the session as planned…

Written by Tony Hirst

March 25, 2015 at 10:38 am

Posted in Anything you want

Tagged with ,

Tech Company Inspired Driverless Cars Don’t Cause Accidents (Apparently), But Their Mobile Phones Do, Presumably..?

leave a comment »

A few weeks ago, the UK Department for Transport (DfT) published a summary report and action plan entitled The Pathway to Driverless Cars that sets out the UK response to car manufacturers and tech companies pushing to develop, test and produce autonomous vehicles on public roads. This was followed by an announcement in the recent (March, 2015) budget where the Chancellor announced that “we are going to back our brilliant automotive industry by investing £100 million to stay ahead in the race to driverless technology” Hansard), and a new code of practice around the testing of driverless vehicles to appear sometime this Spring (2015).

One of the claims widely made in favour of autonomous vehicles/driverless cars by the automotive lobby is that in testing they have a better safety record than human drivers. (Human error plays a role in many accidents.) Whilst the story that a human driver crashes Google’s self-driving car is regularly wheeled out to illustrate how Google’s cars are much safer than humans, we don’t tend to see so many stories about when the human test driver had to take control of the vehicles, whether to avoid an accident, or because road and/or traffic conditions were “inappropriate” for “driverless” operation.

Nor do we hear much about the technology firms making road transport by safer by doing something to mitigate the role that their technology plays in causing accidents.

As Nick Carr writes in his history of automation, The Glass Cage:

It’s worth noting Silicon Valley’s concern with highway safety, though no doubt sincere, has been selective. The distractions caused by cellphones and smartphones have in recent years become a major factor in car crashes. An analysis by the National Safety Council [in the US] implicated phone use in one-fourth of all accidents on US roads in 2012. Yet Google and other top tech firms have made little or no effort to develop software to prevent people from calling, texting or using apps while driving — surely a modest undertaking compared with building a car that can drive itself.

To see what proportion of road traffic incidents involved distractions caused by mobile phones, I thought I’d check the STATS19 dataset. This openly published dataset records details of UK road accidents involving casualties reported to the police. The form used to capture the data includes information relating to up to six contributory factors, including “Driver using mobile phone”.


Unfortunately, the data collected on this part of the form is deemed to be RESTRICTED as opposed to UNCLASSIFIED (the latter classification applying to those elements released in the STATS19 dataset), which means we can’t do any stats around this from the raw data. (I think the reason the data is not released is that it may be used to help de-anonymise incident data by triangulating information contained in the dataset with information gleaned from local news reports about an incident, for example?)

The closest it seems we can get are the DfT’s annual reported road casualties reports (eg 2013) and an old DfT mobile phone usage survey.

The release of the 2013 annual report is supported by a set of statistical tables that break down accidents in all sorts of ways including two tables (ras50012.xls and ras50016.xls) that summarise accident counts on a local authority basis that include mobile phones as a contributory factor in reported incidents. So for example, in England in 2013 there were 384 such incidents. (It is not clear how many of the 2,669 incidents that included a “distraction in vehicle” might also have related to distractions caused by mobile phones particularly… Nor is it clear what the severity or impact of incidents with mobile phones recorded as a contributory factor actually were…

In terms of autonomous vehicle safety, and how the lobbying groups pitch their case, it would also be interesting to know how autonomous vehicles are likely to cope in the context of other contributory factors, such as vision affected by external factors (10,272 (11%) in England in 2013), pedestrian factors only (11,877 (12%)), vehicle defects (1,757, (2%)), or road environment contributed (12,436 (13%)). For cases where there was an “impairment or distraction” in general (12,162 (13%)), it would be interesting to know what would be likely to happen in an autonomous vehicle where the vehicle tried to hand control back to a supervising human driver… (Note that percentages across contributory factors do not sum to 100% – incidents may have had several contributory factors.)

As technology continues to offer ever more “solutions” to claimed problems, I’m really mindful that we need to start being more critical of it and the claims made in pushing particular solutions. In particular, three things concern me: 1) that if we look at the causes of problems that technology claims to fix, maybe technology is contributing to the problem (and the answer is not to apply more technology to treat problems caused by other technology); 2) that we don’t tend to look at the systemic consequences of applying a particular technology; 3) that we don’t tend to recognise how adopting a particular technology can lock us in to a particular set of (inflexible) technology mediated practices, nor how we change our behaviour to suit that technological solution.

On balance, I’m probably negative on the whole tech thing, even though I guess I work within it…

Written by Tony Hirst

March 24, 2015 at 1:19 pm

Posted in Anything you want

Geographical Rights Management, Mesh based Surveillance, Trickle-Down and Over-Reach

with 2 comments

Every so often there’s a flurry of hype around the “internet of things”, but in many respects it’s already here – and has been for several decades. I remember as a kind being intrigued by some technical documents describing some telemetry system or other that remote water treatment plants used to transmit status information back to base. And I vaguely remember from a Maplin magazine around the time an article or two about what equipment you needed to listen in on, and decode, the radio chatter of all manner of telemetry systems.

Perhaps the difference now is a matter of scale – it’s easier to connect to the network, comms are bidirectional (you can receive as well as transmit information), and with code you can effect change on receipt of a message. The tight linkage between software and hardware – bits controlling atoms – also means that we can start to treat more and more things as “plant” whose behaviour we can remotely monitor, and govern.

A good example of how physical, consumer devices can already be controlled – or at least, disabled – by a remote operator is described in a New York Times article that crossed my wires last week, Miss a Payment? Good Luck Moving That Car, which describes how “many subprime borrowers [… in the US] must have their car outfitted with a so-called starter interrupt device, which allows lenders to remotely disable the ignition. Using the GPS technology on the devices, the lenders can also track the cars’ location and movements.” As the loan payment due date looms, it seems that some devices also emit helpful beeps to remind you…. And if your car loan agreement stipulates you’ll only drive within a particular area, I imagine that you could find it’s been geofenced. (A geofence is geographical boundary line that can be used to detect whether a GPS tracked device has passed into, or exited from, a particular region. When used to disable a device that leaves – or enters – a particular area, as for example drones flying into downtown Washington, we might consider it a form “location based management” (or “geographical rights management (GRM)”?!) that can disable activity in a particular location where someone who claims to control use of that device in that space actually exerts their control. (Think: DRM for location…))

One of the major providers “starter interrupt devices” is a company called PassTime (product list). Their products include:

  • PassTime Plus, the core of their “automated collection technology”.
  • Trax: “PassTime TRAX is the entry level GPS tracking product”. Includes: Pin point GPS location service, Up to Six (6) simultaneous Geo-fences.
  • PassTime GPS: “provides asset protection at an economical price while utilizing the same hardware and software platform of PassTime’s Elite Pro line of products. GPS tracking and remote vehicle disable features offer customers tools for a swift recovery if needed.” Includes: Pin point GPS location service, Remote vehicle disable option, Tow-Detect Notification, Device Tamper Notification, Up to Six (6) simultaneous Geo-fences, 24-Hour Tracking, Automatic Location Heartbeat
  • Elite-Pro: “the ultimate combination of GPS functionality and Automated Collection Technology”. Includes the PassTime GPS features but also mentions “Wireless Command Delivery”.

PassTime seem to like the idea of geofences so much they have patents in related technologies: PassTime Awarded Patent for Geo-Fence and Tamper Notice (US Patent: 8018329). You can find other related patents by looking up other patents held by the inventors (for example…).

You’ll be glad to know that PassTime have UK partners… in the form of The Car Finance Company, who are apparently “the world’s largest user and first company in the UK to start fitting Payment Reminder Technology to your new car”. Largest user?! According to a recent [March 12, 2015] press release announcing an extension to their agreement that “will bring 70,000 payment assurance and telematics devices to the United Kingdom”.

Here’s how The Car Finance Company spin it: The Passtime system helps remind you when your repayments are due so you can ensure you stay on track with your loan and help repair and rebuild your credit. The device is only there to help you keep your repayments up to date, it doesn’t affect your car nor does it monitor the way you drive. From the recent press release, “PassTime has been supplying Payment Assurance and GPS devices to The Car Finance Company since 2009″ (my emphasis). I’m not sure if that means the PassTime GPS (with the starter interrupt) or the Trax device? If I was a journalist, rather than a blogger, I’d probably phone them to try to clarify that…

In passing, whilst searching for providers of automotive GPS trackers in the UK (and there are lots of them – search on something like GPS fleet management, for example…) I came across this rather intrusive piece of technology, The TRACKER Mesh Network, which “uses vehicles fitted with TRACKER Locate and TRACKER Plant to pick up reply codes from stolen vehicles with an activated TRACKER unit making them even easier to locate and recover”. Which is to say, this company has an ad hoc, mobile, distributed network of sensors spread across the UK road network that listen out for each other and opportunistically track each other. It’s all good, though:

“The TRACKER Mesh Network will enable the police to extend the network of ‘eyes and ears’ to identify and locate stolen vehicles more effectively using advanced technology and allow us to stay one step ahead of criminals who are becoming more and more adept at stealing cars. This is a real opportunity for the motoring public to help us clamp down on car thieves and raises public confidence in our ability to recover their possessions and bring the offenders to justice.”

(By the by, previous notes on ANPR – Automatic Number Plate Recognition. Also, note the EU eCall accident alerting system that automatically calls for help if you have a car accident [about, UK DfT eCall cost/benefit analysis].)

This conflation of commercial and police surveillance is… to be expected. But the data’s being collected, and it won’t go away. Snowden revelations revealed the scope of security service data collection activities, and chunks of that data won’t be going away either. The scale of the data collection is such that it’s highly unlikely that we’re all being actively tracked or that this data will ever meaningfully contribute to the detection of conspiracies, but it can and will be used post hoc to create paranoid data driven fantasies about who could have have met whom, when, discussed what, and so on.

I guess where we can practically start to get concerned is in considering the ‘trickle down’ way in which access to this data will increasingly be opened up, and/or sold, to increasing numbers of agencies and organisations, both public and private. As Ed Snowden apparently commented in a session as SXSW (Snowden at SXSW: Be very concerned about the trickle down of NSA surveillance to local police), “[t]hey’ve got everything. The question becomes, Now they’re empowered. They can leak [this stuff]. It does happen at the local level. These capabilities are created. High tech. Super secret. But they inevitably bleed over to law enforcement. When they’re brand new they’re only used in the extremes. But as that transition happens, more and more people get access, they use it in newer and more and more expansive and more abusive ways.”

(Trickle down – or over-reach – applies to legislation too. For example, from a story widely reported in April, 2008: Half of councils use anti-terror laws to spy on ‘bin crimes’, although the legality of such practices was challenged: Councils warned over unlawful spying using anti-terror legislation and guidance brought in in November 2012 that required local authorities to obtain judicial approval prior to using covert techniques. (I realise I’m in danger here of conflating things not specifically related to over-reach on laws “intended” to be limited to anti-terrorism associated activities (whatever they are) with over-reach…) Other reviews: Lords Constitution Committee – Second Report – Surveillance: Citizens and the State (Jan 2009), Big Brother Watch on How RIPA has been used by local authorities and public bodies and Cataloguing the ways in which local authorities have abused their covert surveillance powers. I’m guessing a good all round starting point would be the reports of the Independent Reviewer of Terrorism Legislation.)

When it comes to processing large amounts of data, finding meaningful, rather than spurious, connections connections between things can be hard… (Correlation is not causation, right?, as Spurious Correlations wittily points out…;-)

What is more manageable is dumping people onto lists and counting things… Or querying specifics. A major problem with the extended and extensive data collection activities going on at the moment is that access to the data to allow particular queries to be made will be extended. The problem is not that all your data is being collected now, the issue is that post hoc searches over it it could be made by increasing numbers of people in the future. Like bad tempered council officers having a bad day, or loan company algorithms with dodgy parameters.

PS Schneier on connecting the dots.. Why Mass Surveillance Can’t, Won’t, And Never Has Stopped A Terrorist

Written by Tony Hirst

March 23, 2015 at 11:54 am

Posted in Anything you want

Tagged with

So What Can Text Analysis Do for You?

with 4 comments

Despite believing we can treat anything we can represent in digital form as “data”, I’m still pretty flakey on understanding what sorts of analysis we can easily do with different sorts of data. Time series analysis is one area – the pandas Python library has all manner of handy tools for working with that sort of data that I have no idea how to drive – and text analysis is another.

So prompted by Sheila MacNeill’s post about textexture, which I guessed might be something to do with topic modeling (I should have read the about, h/t @mhawksey), here’s a quick round up of handy things the text analysts seem to be able to do pretty easily…

Taking the lazy approach, I has a quick look at the CRAN natural language processing task view to get an idea of what sort of tool support for text analysis there is in R, and a peek through the NLTK documentation to see what sort of thing we might be readily able to do in Python. Note that this take is a personal one, identifying the sorts of things that I can see I might personally have a recurring use for…

First up – extracting text from different document formats. I’ve already posted about Apache Tika, which can pull text from a wide range of documents (PDFs, extract text from Word docs, extract text from images), which seems to be a handy, general purpose tool. (Other tools are available, but I only have so much time, and for now Tika seems to do what I need…)

Second up, concordance views. The NLTK docs describe concordance views as follows: “A concordance view shows us every occurrence of a given word, together with some context.” So for example:


This can be handy for skimming through multiple references to a particular item, rather than having to do a lot of clicking, scrolling or page turning.

How about if we want to compare the near co-occurrence of words or phrases in a document? One way to do this is graphically, plotting the “distance” through the text on the x-axis, and then for categorical terms on y marking out where those terms appear in the text. In NLTK, this is referred to as a lexical dispersion plot:


I guess we could then scan across the distance axis using a windowing function to find terms that appear within a particular distance of each other? Or use co-occurrence matrices for example (eg Co-occurrence matrices of time series applied to literary works), perhaps with overlapping “time” bins? (This could work really well as a graph model – eg for 20 pages, set up page nodes 1-2, 2-3, 3-4,.., 18-19, 19-20, then an actor node for each actor, connecting actors to page nodes for page bins on which they occur; then project the bipartite graph onto just the actor nodes, connecting actors who were originally to the same page bin nodes.)

Something that could be differently useful is spotting common sentences that appear in different documents (for example, quotations). There are surely tools out there that do this, though offhand I can’t find any..? My gut reaction would be to generate a sentence list for each document (eg using something like the handy looking textblob python library), strip quotation marks and whitespace, etc, sort each list, then run a diff on them and pull out the matched lines. (So a “reverse differ”, I think it’s called?) I’m not sure if you could easily also pull out the near misses? (If you can help me out on how to easily find matching or near matching sentences across documents via a comment or link, it’d be appreciated…:-)

The more general approach is to just measure document similarity – TF-IDF (Term Frequency – Inverse Document Frequency) and cosine similarity are key phrases here. I guess this approach could also be applied to sentences to find common ones across documents, (eg SO: Similarity between two text documents), though I guess it would require comparing quite a large number of sentences (for ~N sentences in each doc, it’d require N^2 comparisons)? I suppose you could optimise by ignoring comparisons between sentences of radically different lengths? Again, presumably there are tools that do this already?

Unlike simply counting common words that aren’t stop words in a document to find the most popular words in a doc, TF-IDF moderates the simple count (the term frequency) with the inverse document frequency. If a word is popular in every document, the term frequency is large and the document frequency is large, so the inverse document frequency (one divided by the document frequency) is small – which in turn gives a reduced TF-IDF value. If a term is popular in one document but not any other, the document frequency is small and so the relative document frequency is large, giving a large TF-IDF for the term in the rare document in which it appears. TF-IDF helps you spot words that are rare across documents or uncommonly frequent within documents.

Topic models: I thought I’d played with these quite a bit before, but if I did the doodles didn’t make it as far as the blog… The idea behind topic modeling is generate a set of key terms – topics – that provide an indication of the topic of a particular document. (It’s a bit more sophisticated than using a count of common words that aren’t stopwords to characterise a document, which is the approach that tends to be used when generating wordclouds…) There are some pointers in the comments to A Quick View Over a MASHe Google Spreadsheet Twitter Archive of UKGC12 Tweets about topic modeling in R using the R topicmodels package; this ROpenSci post on Topic Modeling in R has code for a nice interactive topic explorer; this notebook on Topic Modeling 101 looks like a handy intro to topic modeling using the gensim Python package.

Automatic summarisation/text summary generation: again, I thought I dabbled with this but there’s no sign of it on this blog:-( There are several tools and recipes out there that will generate text summaries of long documents, but I guess they could be hit and miss and I’d need to play with a few of them to see how easy they are to use and how well they seem to work/how useful they appear to be. The python sumy package looks quite interesting in this respect (example usage) and is probably where I’d start. A simple description of a basic text summariser can be found here: Text summarization with NLTK.

So – what have I missed?

PS In passing, see this JISC review from 2012 on the Value and Benefits of Text Mining.

Written by Tony Hirst

March 2, 2015 at 2:06 pm

Posted in Anything you want, Rstats

Tagged with

Open Practice and My Academic Philosophy, Sort Of… Erm, Maybe… Perhaps..?!

with 2 comments

Having got my promotion case through the sub-Faculty level committee (with support and encouragement from senior departmental colleagues), it’s time for another complete rewrite to try to get it though the Faculty committee. Guidance suggests that it is not inappropriate – and may even be encouraged – for a candidate to include something about their academic philosophy, so here are some scribbled thoughts on mine…

One of the declared Charter objects (sic) of the Open University is "to promote the educational well-being of the community generally", as well as " the advancement and dissemination of learning and knowledge". Both as a full-time PhD student with the OU (1993-1997), and then as an academic (1999-), I have pursued a model of open practice, driven by the idea of learning in public, with the aim of communicating academic knowledge into, and as part of, wider communities of practice, modeling learning behaviour through demonstrating my own learning processes, and originating new ideas in a challengeable and open way as part of my own learning journey.

My interest in open educational resources is in part a subterfuge, driven by a desire that educators be more open in demonstrating their own learning and critical practices, including the confusion and misconceptions they grapple with along the way, rather than being seen simply as professors of some sort of inalienable academic truth.

My interest in short course development is based on the belief that for the University to contribute effectively to continued lifelong education and professional development, we need to have offerings that are at an appropriate level of granularity as well as academic level. Degrees represent only one - early part - of that journey. Learners are unlikely to take more than one undergraduate degree in their lifetime, but there is no reason why they should not continue to engage in learning throughout their life. Evidence from the first wave of MOOCs suggests that many participants in those courses were already graduates, with an appreciation of the values of learning and the skills to enable them to engage with those offerings. The characteristation of MOOCs as cMOOCs xMOOCs (traditional course style offerings) or the looser networked modeled "connectivist MOOCs", xMOOCs cMOOCs, [H/T @r3becca in the comments;-)] represent different educational philosophies: the former may cruelly be described as being based on a model in which the learner expects to be taught (and the instructors expect to profess), whereas the latter requires that participants are engaged in a more personal, yet still collaborative, learning journey, where it is up to each participant to make sense of the world in an open and public way, informed and aided, but also challenged, by other participants. That's how I work every day. I try to make sense of the world to myself, often for a purpose, in public.

Much of my own learning is the direct result of applied problem solving. I try to learn something every day, often as the result of trying to do something each day that I haven't been able to do before. The OUseful.info blog is my own learning diary and a place I can look to refer to things I have previously learned. The posts are written in a way that reinforces my own learning, as a learning resource. The posts often take longer to write than the time taken to discover or originate the thing learned, because in them I try to represent a reflection and retelling of the rationale for the learning event and the context in which it arose: a problem to be solved, my state of knowledge at the time, the means by which I came to make sense of the situation in order to proceed, and the learning nugget that resulted. The thing I can see or do now but couldn't before. Capturing the "I couldn't do X because of Y but now I can, by doing Z" supports a similar form of discovery as the one supported by question and answer sites: the content is auto-optimised to include both naive and expert information, which aids discovery. (It often amused me that course descriptions would often be phrased in the terms and language you might expect to know having completed the course. Which doesn't help the novice discover it a priori, before they have learned those keywords, concepts or phrases that the course will introduce them to...). The posts also try to model my own learning process, demonstrating the confusion, showing where I had a misapprehension of just plain got it wrong. The blog also represents a telling of my own learning journey over an extended period of time, and such may be though of as an uncourse, something that could perhaps be looked at post hoc as a course but that was originated as my own personal learning journey unfolded.

Hmmm… 1500 words for the whole begging letter, so I need to cut the above down to a sentence…

Written by Tony Hirst

February 19, 2015 at 3:18 pm

Posted in Anything you want


Get every new post delivered to your Inbox.

Join 1,414 other followers