From Elsewhere: Archiving Twitter

Via an Inkdroid post on The Ferguson Principles, this handy suite of tools for archiving and normalising Twitter streams:

  • twarc – a command line tool for collecting tweets from Twitter’s search and streaming APIs, and can collect threaded conversations and user profile information. It also comes with a kitchen sink of utilities contributed by members of the community.
  • Catalog – a clearinghouse of Twitter identifier datasets that live in institutional repositories around the web. These have been collected by folks like the University of North Texas, George Washington University, UC Riverside, University of Maryland, York University, Society of Catalan Archivists, University of Virginia, tUniversity of Puerto Rico, North Carolina State University, University of Alberta, Library and Archives Canada, and more.
  • Hydrator – A desktop utility for turning tweet identifier datasets (from the Catalog) back into structured JSON and CSV for analysis. It was designed to be able to run for weeks on your laptop, to slowly reassemble a tweet dataset, while respecting Twitter’s Terms of Service, and users right to be forgotten.
  • unshrtn – A microservice that makes it possible to bulk normalize and extract metadata from a large number of URLs.
  • DiffEngine – a utility that tracks changes on a website using its RSS feed, and publishes these changes to Twitter and Mastodon. As an example see whitehouse_diffwhich announces changes to the Executive orders made on the White House blog.
  • DocNow – An application (still under development) that allows archivists to observe Twitter activity, do data collection, analyze referenced web content, and optionally send it off to the Internet Archive to be archivd.

The post further remarks:

These tools emerged as part of doing work with social media archives. Rather than building one tool that attempts to solve some of the many problems of archiving social media, we wanted to create small tools that fit particular problems, and could be composed into other people’s projects and workflows.

Handy…

And of the principles mentioned in the original post title?

  1. Archivists must engage and work with the communities they wish to document on the web. Archives are often powerful institutions. Attention to the positionality of the archive vis-à-vis content creators, particularly in the case of protest, is a primary consideration that can guide efforts at preservation and access.
  2. Documentation efforts must go beyond what can be collected without permission from the web and social media. Social media collected with the consent of content creators can form a part of richer documentation efforts that include the collection of oral histories, photographs, correspondence, and more. Simply telling the story of what happens in social media is not enough, but it can be a useful start.
  3. Archivists should follow social media platforms’ terms of service only where they are congruent with the values of the communities they are attempting to document. What is legal is not always ethical, and what is ethical is not always legal. Context, agency and (again) positionality matter.
  4. When possible, archivists should apply traditional archival practices such as appraisal, collection development, and donor relations to social media and web materials. It is hard work adapting these concepts to the collection of social media content, but they matter now, more than ever.

These arise from trying to address several challenges associated with [p]reserving web and social media content in ethical ways that protect already marginalized people (Documenting the Now Ethics White Paper):

  1. User awareness (or informed consent) of how social media platforms use their data or how it can be collected and accessed by third parties.
  2. Potential for fraudulent use and manipulation of social media content.
  3. Heightened potential of harm for members of marginalized communities when those individuals participate in activities such as protests and other forms of civil disobedience that are traditionally heavily monitored by law enforcement.
  4. Difficulty of applying traditional archival practices to social media content given the sheer volume of data and complicated logistics of interacting with content creators.

The white paper can be found here: Documenting The Now White Paper — Ethical Considerations for Archiving Social Media Content Generated by Contemporary Social Movements: Challenges, Opportunities, and Recommendations [PDF].

Revisiting My Twitter Harvesting Code

Despite having suffered a catastrophic/unrecoverable hard-disk failure on the (unbacked up) machine I had my Twitter harvesting notebooks (and cached data database) on, I did manage to find a reasonably current version of the code (via Github gists and Dropbox) and spent a few evening hours tinkering with over the last ten days or so.

So as a quick to note-to-self, here’s a list of the functions I currently have to hand:

  • search for users using a recent search terms: get a list of users recently using a particular term or phrase;
  • search for users using a recent hashtag: get a list of users recently using a particular hashtag;
  • generate maps of folk commonly followed by users of the searchterm/tag: from the term or tag userlist, find the folk commonly followed by those users and generate a network edge list;
  • get members of a list: get a list of the members of a particular list;
  • get lists a person is a member of: get a list of the lists a user is a member of; optionally limit to lists with more than a certain number of followers;
  • triangulate lists: find lists that several specified users are a member of, thresholded (so e.g. lists where at least 3 of 5 people mentioned are on the list); also limit by minimum number of subscribers to list (so we can ignore lists with no subscribers etc). List triangulation can be applied to lists of users e.g. folk using a particular hashtag; so we have a route to finding lists that may be topically related to a particular tag;
  • download members of lists a specified user is a member of: for the lists a particular user is a member of, grab details of all the members of those lists’
  • get all friends/followers of a user: this can be limited to a maximum number of friends/followers (eg 5000);
  • get common friends of (sampled) followers of a user: for a particular user, get their followers, sample N of them, then find folk commonly followed by that sample; output as a graph edge list;
  • find common followers of a set of specified users: for a list of users (e.g. recent users of a particular hashtag), find folk who follow a minimum number of them, or who are followed by a minimum number of them;
  • tag user biographies using Thomson Reuters OpenCalais and IBM Alchemy APIs: this tagging can be easily applied to all users in a list, tagging their biographies one at a time

I’ve also started looking again at generating topic models around Twitter data, starting with user biographies (which so far is not very interesting!)

With these various functions, it’s easy enough to generate various combinations of emergent social positioning map. I’ve started exploring various Python libraries for clustering and laying out maps automatically, but tend to fall back to handcrafting the displays using Gephi. On the to do list is to try to automate the Gephi side, at least for a first pass, using the Gephi toolkit, though at the moment that looks like requiring that I get my head round a bit of Java. Ideally, I’d like to be able to see a Gephi endpoint (perhaps from a Gephi headless server running in a Docker container…?:-), give it a graph file and a config file, and get a PDF, SVG or PNG layout back…

I also need to do a couple of proof-of-concept one-off printed outputs for myself, like getting an ESP map printed as an A0 poster or folded map.

Estimated Follower Accession Charts for Twitter

Just over a year or so ago, Mat Morrison/@mediaczar introduced me to a visualisation he’d been working on (How should Page Admins deal with Flame Wars?) that I started to refer to as an accession chart (Visualising Activity Around a Twitter Hashtag or Search Term Using R). The idea is that we provide each entrant into a conversation or group with an accession number: the first person has accession number 1, the second person accession number 2 and so on. The accession number is plotted in rank order on the vertical y-axis, with ranked/time ordered “events” along the horizontal x-axis: utterances in a conversation for example, or posts to a forum.

A couple of months ago, I wondered whether this approach might also be used to estimate when folk started following an individual on Twitter. My reasoning went something like this:

One of the things I think is true of the Twitter API call for the followers of an account is that it returns lists of followers in reverse accession order. So the person who followed an account most recently will be at the top of the list (the first to be returned) and the person who followed first will be at the end of the list. Unfortunately, we don’t know when followers joined, so it’s hard to spot bursty growth in the number of followers of an account. However, it struck me that we may be able to get a bound on this by looking at the dates at which followers joined Twitter, along with their ‘accession order’ as followers of an account. If we get the list of followers and reverse it, and assume that this gives an ordered list of followers (with the follower that started following the longest time ago first), we can then work through this list and keep track of the oldest ‘created_at’ date seen so far. This gives us an upper bound (most recent date) for when followers that far through the list started following. (You can’t start following until you join twitter…)

So for example, if followers A, B, C, D in that accession order (ie started following target in that order) have user account creation dates 31/12/09, 1/1/09, 15/6/12, 5/5/10 then:
– A started following no earlier than 31/12/09 (because that’s when they joined Twitter and it’s the most recent creation date we’ve seen so far)
– B started following no earlier than 31/12/09 (because they started following after B)
– C started following no earlier than 15/6/12 (because that’s when they joined Twitter and it’s the most recent creation date we’ve seen so far)
– D started following no earlier than 15/6/12 (because they started following after C, which gave use the most recent creation date seen so far)

That’s probably confused you enough, so here’s a chart – accession number is along the bottom (i.e. the x-axis), joining date (in days ago) is on the y-axis:

recencyVacc

NOTE: this diverges from the accession graph described above, where accession number goes on the y-axis and rank ordered event along the x-axis.

What the chart shows is an estimate (the red line) of how many days ago a follower with a particular accession number started to follow a particular Twitter account.

As described in Sketches Around Twitter Followers, we see a clear break at 1500 days ago when Twitter started to get popular. This approach also suggests a technique for creating “follower probes” that we can use to date a follower record: if you know which day a particular user followed a target account, you can use that follower to put a datestamp into the follower record (assuming the Twitter API returned followers in reverse accession order).

Here’s an example of the code I used based on Twitter follower data grabbed for @ChrisPincher (whose follower profile appeared to be out of sorts from the analysis sketched in Visualising Activity Around a Twitter Hashtag or Search Term Using R). I’ve corrected the x/y axis ordering so follower accession number is now the vertical, y-component.

require(ggplot2)

processUserData = function(data) {
    data$tz = as.POSIXct(data$created_at)
    data$days = as.integer(difftime(Sys.time(), data$tz, units = "days"))
    data = data[rev(rownames(data)), ]
    data$acc = 1:length(data$days)
    data$recency = cummin(data$days)

    data
}

mp_cp <- read.csv("~/code/MPs/ChrisPincher_fo_0__2013-02-16-01-29-28.csv", row.names = NULL)

ggplot(processUserData(mp_cp)) +  geom_point(aes(x = -days, y = acc), size = 0.4) + geom_point(aes(x = -recency, y = acc), col = "red", size = 1)+xlim(-2000,0)

Here’s @ChrisPincher’s chart:

cp_demo

The black dots reveal how many days ago a particular follower joined Twitter. The red line is the estimate of when a particular follower started following the account, estimated based on the most recently created account seen to date amongst the previously acceded followers.

We see steady growth in follower numbers to start with, and then the account appears to have been spam followed? (Can you spot when?!;-) The clumping of creation dates of accounts during the attack also suggests they were created programmatically.

[In the “next” in this series of posts [What Happened Then? Using Approximated Twitter Follower Accession to Identify Political Events], I’ll show how spikes in follower acquisition on a particular day can often be used to “detect” historical news events.]

PS after coming up with this recipe, I did a little bit of “scholarly research” and I learned that a similar approach for estimating Twitter follower acquisition times had already been described at least once, at the opening of this paper: We Know Who You Followed Last Summer: Inferring Social Link Creation Times In Twitter – “We estimate the edge creation time for any follower of a celebrity by positing that it is equal to the greatest lower bound that can be deduced from the edge orderings and follower creation times for that celebrity”.

Sketches Around Twitter Followers

I’ve been doodling… Following a query about the possible purchase of Twitter followers for various public figure accounts (I need to get my head round what the problem is with that exactly?!), I thought I’d have a quick look at some stats around follower groupings…

I started off with a data grab, pulling down the IDs of accounts on a particular Twitter list and then looking up the user details for each follower. This gives summary data such as the number of friends, followers and status updates; a timestamp for when the account was created; whether the account is private or not; the “location”, as well as a possibly more informative timezone field (you may tell fibs about the location setting but I suspect the casual user is more likely to set a timezone appropriate to their locale).

So what can we do with that data? Simple scatter plots, for one thing – here’s how friends vs. followers distribute for MPs on the Tweetminster UKMPs list:

ukMPS_frfo_scatter

We can also see how follower numbers are distributed across those MPs, for example, which looks reasonable and helps us get our eye in…:

ukMPS_fo_dist

We can also calculate ratios and then plot them – followers per day (the number of followers divided by the number of days since the account was registered, for example) vs the followers per status update (to try to get a feeling of how the number of followers relates to the number of tweets):

ukMPs_foday_fost

This particular view shows a few outliers, and allows us to spot a couple of accounts that have recently had a ‘change of use’.

As well as looking at the stats across the set of MPs, we can pull down the list of followers of a particular account (or sample thereof – I grabbed the lesser of all followers or 10,000 randomly sampled followers from a target account) and then look at the summary stats (number of followers, friends, date they joined Twitter, etc) over those followers.

So for example, things like this – a scatterplot of friends/follower counts similar to the one above:

friendsfollowers

…sort of. There’s something obviously odd about that graph, isn’t there? The “step up” at a friends count of 2000. This is because Twitter imposes, in most cases, a limit of 2000 friends on an account.

How about the followers per day for an account versus the number of days that account has been on Twitter, with outliers highlighted?

foperday_days

Alternatively, we can do counts by number of days the followers have been on Twitter:

Rplot

The bump around 1500 days ago corresponds to Twitter getting suddenly popular around then, as this chart from Google Trends shows:

gtrends

Sometimes, you get a distribution that is very, very wrong… If we do a histogram that has bins along the x-axis specifying that a follower had 0-100 followers of their own, or 500-600 followers etc, and then for all the followers of a particular account, pop them into a corresponding bin given the number of their followers, counting the number of people in each bin once we have allocated them all, we might normally expect to see something like this:

normally log followers

However, if an account is followed by lots of followers that have zero or very few followers of their own, we get a skewed distribution like this:

a dodgy follower distribution

There’s obviously something not quite, erm, normal(?!) about this account (at least, at the time I grabbed the data, there was something not quite normal etc etc…).

When we get stats from the followers of a set of folk, such as the members of a list, we can generate summary statistics over the sets of followers of each person on the list – for example, the median number of followers, or different ratios (eg mean of the friend/follower ratios for each follower). Lots of possible stats – but which ones does it make sense to look at?

Here’s one… a plot of the median followers per status ratio versus the median friend/follower ratio:

fostvfrfo

Spot the outlier ;-)

So that’s a quick review of some of the views we can get from data grabs of the user details from the followers of a particular account. A useful complement to the social positioning maps I’ve also been doing for some time:

davidevennett

It’s just a shame that my whitelisted Twitter API key is probably going to die in few weeks:-(

[In the next post in this series I’ll describe a plot that estimates when folk started following a particular account, and demonstrate how it can be used to identify notable “events” surrounding the person being followed…]

Interest Differencing: Folk Commonly Followed by Tweeting MPs of Different Parties

Earlier this year I doodled a recipe for comparing the folk commonly followed by users of a couple of BBC programme hashtags (Social Media Interest Maps of Newsnight and BBCQT Twitterers). Prompted in part by a tweet from Michael Smethurst/@fantasticlife about generating an ESP map for UK politicians (something I’ve also doodled before – Sketching the Structure of the UK Political Media Twittersphere) I drew on the @tweetminster Twitter lists of MPs by party to generate lists of folk commonly followed by the MPs of each party.

Using the R wordcloud library commonality and comparison clouds, we can get a visual impression of folk commonly followed in significant numbers by all the MPs of the three main parties, as well as the folk the MPs of each party follow significantly and differentially to the other parties:

There’s still a fair bit to do making the methodology robust (for example, being able to cope with comparing folk commonly followed by different sets of users where the size of the set differs to a significant extent (for example, there is a large difference between the number of tweeting Conservative and LibDem MPs). I’ve also noticed that repeatedly running the comparison.cloud code turns up different clouds, so there’s some element of randomness in there. I guess this just adds to the “sketchy” nature of the visualisation; or maybe hints at a technique akin to the way a photogrpaher will take multiple shots of a subject before picking one or two to illustrate something in particular. Which is to say: the “truthiness” of the image reflects the message that you are trying to communicate. The visualisation in this case exposes a partial truth (which is to say, no absolute truth), or particular perspective about the way different groups differentially follow folk on Twitter. A couple of other quirks I’ve noticed about the comparison.cloud as currently defined: firstly, very highly represented friends are sized too large to appear in the cloud (which is why very commonly followed folk across all sets – the people that appear in the commonality cloud – tend not to appear) – there must be a better way of handling this? Secondly, if one person is represented so highly in one group that they don’t appear in the cloud for that group, they may appear elsewhere in the cloud. (So for example, I tried plotting clouds for folk commonly followed by a sample of the followers of @davegorman, as well as the people commonly followed by the friends of @davegorman – and @davegorman appeared as a small label in the friends part of the comparison.cloud (notwithstanding the fact that all the followers of @davegorman follow @davegorman, but not all his friends do… What might make more sense would be to suppress the display of a label in the colour of a particular group if that label has a higher representation in any of the other groups (and isn’t displayed because it would be too large)).

That said, as a quick sketch, I think there’s some information being revealed there (the coloured comparison.cloud seems to pull out some names that make sense as commonly followed folk peculiar to each party…). I guess way forward is to start picking apart the comparison.cloud code, another is to explore a few more comparison sets? Suggestions welcome as to what they might be…:-)

PS by the by, I notice via the Guardian datablog (Church vs beer: using Twitter to map regional differences in US culture) another Twitter based comparison project – Church or Beer? Americans on Twitter – which looked at geo-coded Tweets over a particular time period on a US state-wide basis and counted the relative occurrence of Tweets mentioning “church” or “beer”…

Twitter Volume Controls

With a steady stream of tweets coming out today containing local election results, @GuardianData (as @datastore was recently renamed) asked whether or not regular, stream swamping updates were in order:

A similar problem can occur when folk are livetweeting an event – for a short period, one or two users can dominate a stream with a steady outpouring of live tweets.

Whilst I’m happy to see the stream, I did wonder about how we could easily wrangle a volume control, so here are a handful of possible approaches:

  • Tweets starting @USER ... are only seen in the stream of people following both the sender of the tweet and @USER. So if @GuardianData set up another, non-tweeting, account, @GuardianDataBlitz, and sent election results to that account (“@GuardianDataBlitz Mayor referendum results summary: Bradford NO (55.13% on ), Manchester NO (53.24%), Coventry NO (63.58%), Nottingham NO (57.49%) #vote2012” for example), only @GuardianData followers following @GuardianDataBlitz would see the result. There are a couple of problems with this approach, of course: for one, @GuardianDataBlitz takes up too many characters (although that can be easily addressed), but more significantly it means that most followers of @GuardianData will miss out on the data stream. (They can be expected to necessarily know about the full fat feed switch.)
  • For Twitter users using a Twitter client that supports global filtering of tweets across all streams within a client, we may be able to set up a filter to exclude tweets of the form (@GuardianData AND #vote2012). This is a high maintenance approach, though, and will lead to the global filter getting cluttered over time, or at least requiring maintenance.
  • The third approach – again targeted at folk who can set up global filters – is for @GuardianData to include a volume control in their tweets, eg Mayor referendum results summary: Bradford NO (55.13% on ), Manchester NO (53.24%), Coventry NO (63.58%), Nottingham NO (57.49%) #vote2012 #blitz. Now users can set a volume control by filtering out terms tagged #gblitz. To remind people that they have a volume filter in place, @GuardianData could occasionally post blitz items with #inblitz to show folk who have the filter turned on what they’re missing? Downsides to this approach are that it pollutes the tweets with more confusing metadata maybe confuses folk about what hashtag is being used.
  • A more generic approach might be to use a loudness indicator or channel that can be filtered against, so for example channel 11: ^11 or ^loud (reusing the ^ convention that is used to identify individuals tweeting on a team account)? Reminders to folk who may have a volume filter set could take the form ^on11 or ^onloud on some of the tweets? Semantic channels might also be defined: ^ER (Election Results), ^LT (Live Tweets) etc, again with occasional reminders to folk who’ve set filters (^onLT, etc, or “We’re tweeting local election results on the LT ^channel today”)). Again, this is a bit of a hack that’s only likely to appeal to “advanced” users and does require them to take some action; I guess it depends whether the extra clutter is worth it?

So – any other volume control approaches I’ve missed?

PS by the by, here’s a search query (just for @daveyp;-) that I’ve been using to try to track results as folk tweet them:

-RT (#atthecount OR #vote2012 OR #le2012) AND (gain OR held OR los OR hold) AND (con OR lib OR lab OR ukip)

I did wonder about trying to parse out ward names to try an automate the detection of possible results as they appeared in the stream, but opted to go to bed instead! It’s something I could imagine trying to work up on Datasift, though…

Doodling With a Conversation, or Retweet, Data Sketch Around LAK12

How can we represent conversations between a small sample of users, such as the email or SMS converstations between James Murdoch’s political lobbiest and a Government minister’s special adviser (Leveson inquiry evidence), or the pattern of retweet activity around a couple of heavily retweeted individuals using a particular hashtag?

I spent a bit of time on-and-off today mulling over ways of representing this sort of interaction, in search of something like a UML call sequence diagram but not, and here’s what I came up with in the context of the retweet activity:

The chart looks a bit complicated at first, but there’s a lot of information in there. The small grey dots on their own are tweets using a particular hashtag that aren’t identified as RTs in a body of tweets obtained via a Twitter search around a particular hashtag (that is, they don’t start with a pattern something like RT @[^:]*:). The x-axis represents the time a tweet was sent and the y-axis who sent it. Paired dots connected by a vertical line segment show two people, one of whom (light grey point) retweeted the other (dark grey point). RTs of two notable individuals are highlighted using different colours. The small black dots highlight original tweets sent by the individuals who we highlight in terms of how they are retweeted. Whilst we can’t tell which tweet was retweeted, we may get an idea of how the pattern of RT behaviour related to the individuals of interest plays out relative to when they actually tweeted.

Here’s the R-code used to build up the chart. Note that the order in which the layers are constructed is important (for example, we need the small black dots to be in the top layer).

##RT chart, constructed in R using ggplot2
require(ggplot2)
#the base data set - exclude tweets that aren't RTs
g = ggplot(subset(tw.df.rt,subset=(!is.na(rtof))))
#Add in vertical grey lines connecting who RT'd whom
g = g + geom_linerange(aes(x=created,ymin=screenName,ymax=rtof),colour='lightgrey')
#Use a more complete dataset to mark *all* tweets with a lightgrey point
g = g + geom_point(data=(tw.df),aes(x=created,y=screenName),colour='lightgrey')
#Use points at either end of the RT line segment to distinguish who RTd whom
g = g + geom_point(aes(x=created,y=screenName),colour='lightgrey') + geom_point(aes(x=created,y=rtof),colour='grey') + opts(axis.text.y=theme_text(size=5))
#We're going to highlight RTs of two particular individuals
#Define a couple of functions to subset the data
subdata.rtof=function(u) return(subset(tw.df.rt,subset=(!is.na(rtof) & rtof==u)))
subdata.user=function(u) return(subset(tw.df.rt,subset=(!is.na(rtof) & screenName==u)))
#Grab user 1
s1='gsiemens'
ss1=subdata.rtof(s1)
ss1x=subdata.user(s1)
sc1='aquamarine3'
#Highlight the RT lines associated with RTs of this user
g = g + geom_linerange(data=ss1,aes(x=created,ymin=screenName,ymax=rtof),colour=sc1)
#Grab user 2
s2='busynessgirl'
ss2=subdata.rtof(s2)
ss2x=subdata.user(s2)
sc2='orange'
#Highlight the RT lines associated with RTs of this user
g = g + geom_linerange(data=ss2,aes(x=created,ymin=screenName,ymax=rtof),colour=sc2)
#Now we add another layer to colour the nodes associated with RTs of the two selected users
g = g + geom_point(data=ss1,aes(x=created,y=rtof),colour=sc1) + geom_point(data=ss1,aes(x=created,y=screenName),colour=sc1)
g = g + geom_point(data=ss2,aes(x=created,y=rtof),colour=sc2) + geom_point(data=ss2,aes(x=created,y=screenName),colour=sc2)
#Finally, add a highlight to mark when the RTd folk we are highlighting actually tweet
g = g + geom_point(data=(ss1x),aes(x=created,y=screenName),colour='black',size=1)
g = g + geom_point(data=(ss2x),aes(x=created,y=screenName),colour='black',size=1)
#Print the chart
print(g)

One thing I’m not sure about is the order of names on the y-axis. That said, one advantage of using the conversational, exploratory visualisation data approach that I favour is that if you let you eyes try to seek out patterns, you may be able to pick up clues for some sort of model around the data that really emphasises those patterns. So for example, looking at the chart, I wonder if there would be any merit in organising the y-axis so that folk who RTd orange but not aquamarine were in the bottom third of the chart, folk who RTd aqua but not orange were in the top third of the chart, folk who RTd orange and aqua were between the two users of interest, and folk who RTd neither orange nor aqua were arranged closer to the edges, with folk who RTd each other close to each other (invoking an ink minimisation principle)?

Something else that it would be nice to do would be to use the time an original tweet was sent as the x-axis value for the tweet marker for the original sender of a tweet that is RTd. We would then get a visual indication of how quickly a tweet was RTd.

PS I also created a script that generated a wealth of other charts around the lak12 hashtag [PDF]. The code used to generate the report can be found as the file exampleSearchReport.Rnw in this code repository.