A couple of days ago, the Guardian reported a quote from Dimblebobs about Question Time being bigger than X-Factor on Twitter (How Question Time got as big as The X Factor on Twitter); so when are we going to see optional Twitter captions made available, either in real time, or on catch-up services such as iPlayer? (If you haven’t been keeping up: Twitter captions/subtitles are captions generated as an overlay for a video video based on tweets from members of a particular Twitter list, or using a particular hashtag. (In the future, it might also be worth considering the capture of tweets based on location?) Martin Hawksey has been developing several tools in this area: Twitter subtitling. His most recent demonstration – iTitle: Full circle with Twitter subtitle playback in YouTube (ALT-C 2010 Keynotes) – describes how videos of the ALT-C 2010 keynotes have been recently republished along with searchable Twitter captions).
As Martin hinted at in What they were saying: Leaders debate on BBC iPlayer with twitter subtitles from parliamentary candidates and in the comments to that post, the volume and rate of production of tweets for a popular live event may be too great to display them all via the caption feed and still give the viewer time to read them. Which means, for heavy volumne backchannels, tweets need filtering or sampling (ideally in a way that avoids undue bias?) in order to limit the number (and quality?) of tweets that are actually displayed as captions. So what are the options?
First of all, we should distinguish whether we intend to work on a live feed, or an archive feed. An archive feed means that samples or filters can be in part tuned according to a post hoc analysis of all the tweets; whereas the live feed may either work in a stateless way, judging whether or not to show any individual tweet based solely on its own merits, (for example, showing any particular tweet with given, fixed probability p), or based at least in part on the history of tweets already observed.
I think we should also distinguish between sampling of Tweets, versus filtering them. By sampling, I mean selecting each individual tweet according to probability p independently of any other information; by filtering, I mean selecting a tweet based on it or its metadata containing a particular term (for example: only selecting tweets from certain individuals, block tweets starting RT, and so on). Note that both sampling and filtering may feature in the selection of tweets for display, in either order (sample, then filter, or filter then sample), or in more elaborate combinations (sample, filter, sample, for example).
So what strategies are there..? Note that this isn’t a very principled list (been a long day!), and it is likely to be far from complete, but it’s a start, and something to mull over at least…
Sampling
– display every n‘th tweet;
– display the most recently received tweet in the last x seconds every y seconds;
– display any given tweet with fixed probability, p:
Historyless Filtering
– filter out rewteets (items starting RT);
– filter out tweets sent to a person (tweets starting @). (Note that this does mean we limit the extent to which conversations might be displayed);
– filter tweets based on some function of the number of friends and or followers a sender has;
History-based Filtering
– filter based on the number of tweets the user has already sent;
– filter based on properties of the hashtag network (for example, number of hashtaggers following an individual. I have classed this as a history-based filter because we need some knowledge of the hashtag community, generated from a history of tweets, in order to calculate hashtag network metrics;
– filter based on the extent to which tweets are appratently part of a conversation thread (e.g. construct a conversation graph in which @a mentions @b and @b mentions @a, and select all conversations greater than a particular length. Note that we might combine this condition with other conditions, such as “where @a and @b share more than m common followers”.
Note that the filtering approach may be used to either filter out tweets and prevent them from being displayed, or select tweets according a particular set of criteria that means they should be displayed. In addition, filtering may be deterministic or combined with a probabilistic sampling mechanism. For example, we may choose to display a tweet with probability p where p is a function of some ranking factor with value f. An alternative approach might be to generate a score for each tweet based on one or more ranking factor as described in the filter considerations above, rank the tweets by score, and then display the one with the highest score at any given time.
The history based approach may be used in real time, making selections based on the tweets observed (and/or maybe just the tweets displayed) so far (until now history), or, in cases where a Twitter caption file is being generated after the fact, through analysis of the whole hashtag archive corpus (total archive). So for example, it might be that the caption file is generated after the event for use only by catch-up viewers, with the expectation that live viewers would be able to entertain themselves direclty from a live Twitter feed in their own client.