Time, Yet, for Twitter Captions on BBC iPlayer Content?

A couple of days ago, the Guardian reported a quote from Dimblebobs about Question Time being bigger than X-Factor on Twitter (How Question Time got as big as The X Factor on Twitter); so when are we going to see optional Twitter captions made available, either in real time, or on catch-up services such as iPlayer? (If you haven’t been keeping up: Twitter captions/subtitles are captions generated as an overlay for a video video based on tweets from members of a particular Twitter list, or using a particular hashtag. (In the future, it might also be worth considering the capture of tweets based on location?) Martin Hawksey has been developing several tools in this area: Twitter subtitling. His most recent demonstration – iTitle: Full circle with Twitter subtitle playback in YouTube (ALT-C 2010 Keynotes) – describes how videos of the ALT-C 2010 keynotes have been recently republished along with searchable Twitter captions).

As Martin hinted at in What they were saying: Leaders debate on BBC iPlayer with twitter subtitles from parliamentary candidates and in the comments to that post, the volume and rate of production of tweets for a popular live event may be too great to display them all via the caption feed and still give the viewer time to read them. Which means, for heavy volumne backchannels, tweets need filtering or sampling (ideally in a way that avoids undue bias?) in order to limit the number (and quality?) of tweets that are actually displayed as captions. So what are the options?

First of all, we should distinguish whether we intend to work on a live feed, or an archive feed. An archive feed means that samples or filters can be in part tuned according to a post hoc analysis of all the tweets; whereas the live feed may either work in a stateless way, judging whether or not to show any individual tweet based solely on its own merits, (for example, showing any particular tweet with given, fixed probability p), or based at least in part on the history of tweets already observed.

I think we should also distinguish between sampling of Tweets, versus filtering them. By sampling, I mean selecting each individual tweet according to probability p independently of any other information; by filtering, I mean selecting a tweet based on it or its metadata containing a particular term (for example: only selecting tweets from certain individuals, block tweets starting RT, and so on). Note that both sampling and filtering may feature in the selection of tweets for display, in either order (sample, then filter, or filter then sample), or in more elaborate combinations (sample, filter, sample, for example).

So what strategies are there..? Note that this isn’t a very principled list (been a long day!), and it is likely to be far from complete, but it’s a start, and something to mull over at least…

Sampling
– display every n‘th tweet;
– display the most recently received tweet in the last x seconds every y seconds;
– display any given tweet with fixed probability, p:

Historyless Filtering
– filter out rewteets (items starting RT);
– filter out tweets sent to a person (tweets starting @). (Note that this does mean we limit the extent to which conversations might be displayed);
– filter tweets based on some function of the number of friends and or followers a sender has;

History-based Filtering
– filter based on the number of tweets the user has already sent;
– filter based on properties of the hashtag network (for example, number of hashtaggers following an individual. I have classed this as a history-based filter because we need some knowledge of the hashtag community, generated from a history of tweets, in order to calculate hashtag network metrics;
– filter based on the extent to which tweets are appratently part of a conversation thread (e.g. construct a conversation graph in which @a mentions @b and @b mentions @a, and select all conversations greater than a particular length. Note that we might combine this condition with other conditions, such as “where @a and @b share more than m common followers”.

Note that the filtering approach may be used to either filter out tweets and prevent them from being displayed, or select tweets according a particular set of criteria that means they should be displayed. In addition, filtering may be deterministic or combined with a probabilistic sampling mechanism. For example, we may choose to display a tweet with probability p where p is a function of some ranking factor with value f. An alternative approach might be to generate a score for each tweet based on one or more ranking factor as described in the filter considerations above, rank the tweets by score, and then display the one with the highest score at any given time.

The history based approach may be used in real time, making selections based on the tweets observed (and/or maybe just the tweets displayed) so far (until now history), or, in cases where a Twitter caption file is being generated after the fact, through analysis of the whole hashtag archive corpus (total archive). So for example, it might be that the caption file is generated after the event for use only by catch-up viewers, with the expectation that live viewers would be able to entertain themselves direclty from a live Twitter feed in their own client.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering... View all posts by Tony Hirst

3 thoughts on “Time, Yet, for Twitter Captions on BBC iPlayer Content?”

FWIW, since you prompted me (some time back!), here’s some quick thoughts on it.

You are looking for a way to choose tweets that would interest the viewer, among the deluge of too-many.

This is an issue of the age (Shirky, Weinberger etc), is it not? There’s too much (miscellaneous) information out there*, so who will and how will they filter?

(*Actually I’m not sure it is that simple. I think there is an argument to be made as to whether there’s really more information out there, but I’ll gloss over that for now.)

Generally, stuff on the web gets my attention through (in no particular order):

1 Authority/reputation. Eg I look at the BBC and the Guardian websites, and follow tweets from The London Review of Books.
2 Lots of people rate it – trending topics on twitter
3 Friends say ‘look at this’.
4 I predefine what I’m interested in (searching for stuff – inc. rss feeds from a customised search using pipes)
5 A bit (whole lot) of serendipity

Which of these can be reproduced in filtering your twitter captions?

I think #3 & #4 are excluded because you are considering a broadcast system.

Sampling would be all #5

#2 could be done by rating tweets according to how often they’ve been retweeted?

Your ideas about following a conversation also sort of links to #2 because if there is a conversation developing, it suggests some people think it is about something important.

Not sure who would decide who counts as an authority for #1, though of course it could easily be done.

Coming at it from another direction, how are the phone-calls to ‘Any answers’ (http://www.bbc.co.uk/programmes/b006qmmy) filtered? I don’t know, but I would imagine they:

i take representative calls (ie if lots of people are calling about the same thing, they’ll take at least one of them)
ii look for balance
iii look for novel viewpoints

#i is like #2, but I can’t seen any way of automating #ii or #iii.

Tony Hirst says:

December 8, 2010 at 3:19 pm

Thanks for those comments; filtering out RTs but using that as a signal of something worth displaying is a sensible signal for the archival processing I think?

As to strategies used on eg Any QUestions – hmmm, that’s an interesting thought…