Finding Common Terms around a Twitter Hashtag

@aendrew sent me a link to a StackExchange question he’s just raised, in a tweet asking: “Anyone know how to find what terms surround a Twitter trend/hashtag?”

I’ve dabbled in this area before, though not addressing this question exactly, using Yahoo Pipes to find what hashtags are being used around a particular search term (Searching for Twitter Hashtags and Finding Hashtag Communities) or by members of a particular list (What’s Happening Now: Hashtags on Twitter Lists; that post also links to a pipe that identifies names of people tweeting around a particular search term.).

So what would we need a pipe to do that finds terms surrounding a twitter hashtag?

Firstly, we need to search on the tag to pull back a list of tweets containing that tag. Then we need to split the tweets into atomic elements (i.e. separate words). At this point, it might be useful to count how many times each one occurs, and display the most popular. We might also need to generate a “stop list” containing common words we aren’t really interested in (for example, the or and.

So here’s a quick hack at a pipe that does just that (Popular words round a hashtag).

For a start, I’m going to construct a string tokeniser that just searches for 100 tweets containing a particular search term, and then splits each tweet up in separate words, where words are things that are separated by white space. The pipe output is just a list of all the words from all the tweets that the search returned:

Twitter string tokeniser

You might notice the pipe also allows us to choose which page of results we want…

We can now use the helper pipe in another pipe. Firstly, let’s grab the words from a search that returns 200 tweets on the same search term. The helper pipe is called twice, once for the first page of results, once for the second page of results. The wordlists from each search query are then merged by the union block. The Rename block relabels the .content attribute as the .title attribute of each feed item.

Grab 200 tweets and check we have set the title element

The next thing we’re going to do is identify and count the unique words in the combined wordlist using the Unique block, and then sort the list accord to the number of times each word occurs.

Preliminary parsing of a wordlist

The above pipe fragment also filters the wordlist so that only words containing alphabetic characters are allowed through, as well as words with four or more characters. (The regular expression .{4,} reads: allow any string of four or more ({4,}) characters of any type (.). An expression .{5,7} would say – allow words through with length 5 to 7 characters.)

I’ve also added a short routine that implements a stop list. The regular expression pattern (?i)\b(word1|word2|word3)\b says: ignoring case ((?i)),try to match any of the words word1, word2, word3. (\b denotes word boundary.) Note that in the filter below, some of the words in my stop list are redundant (the ones with three or fewer characters. Remember, we have already filtered the word list to show only words of length four or more characters.)

Stop list

I also added a user input that allows additional stop terms to be added (they should be pipe (|) separated, with no spaces between them). You can find the pipe here.

Searching for Twitter Hashtags and Finding Hashtag Communities

Over the last few weeks I’ve been messing around more than I should with Twitter, and in particular trying to get a feel for how we might use hashtag communities as a well of identifying and growing community structures in a particular topic area (see posts all over for more details).

A couple of days ago, @clarileia raised the question of how you find new hashtags, so I had a little tinker today putting together a couple of hacks (Twitter hashtag search pipe and Twitter my network hashtags) that let you identify recently used Twitter hashtags associated with a particular search term, or with a specified user’s recent friends or followers.

Twiitter hashtag search

[Note: at the time of writing, Pipes appears to be running a little slow… if the Pipe appears to stall, it does work, honest… try it again later ;-)]

At the core of all the hacks is a clunky hashtag tokeniser pipe that takes a Twitter status update and pulls out the hashtags:

This utility pipe works by taking the status update, extracting the hashtags using a hacked together regular expression, splits the separate hashtags into separate feed items, and then filters them to emit only legitimate hashtags.

The utility pipe is then used in a search powered pipe, which searches twitter for the 100 most recent tweets containing the search terms and then scans those for hashtags; and a ‘personal network hashtags’ pipe that takes a Twitter username, pulls back the tweets from their one hundred most recent friends, and their one hundred most recently followers, and then scans those tweets for hashtags.

For example, here’s the search pipe:

Both pipes have a common output routine – the list of hashtags is filtered through the Unique block, which also returns a count of how many times each hashtag has appeared. The hashtags are then ordered and filtered according to the minimum number of required occurrences in the sample. A regular expression adds the number of occurrences of each hashtag.

The pipes could be extended to pull in more search results, or more followers/friends (maybe the first hundred friends/followers as well as the most recent hundred?) but that’s left as an exercise for the reader. As for the use case – I dunno? Maybe integration with the OUseful TwitterMyHashtag apps? Or perhaps @clarileia had a use case in mind?!;-)

PS thanks to PJ on the Yahoo Pipes team for getting back to me earlier today when I was struggling with a slow running pipes editor… I’m now totally reliant on Pipes for many apps, and especially for rapid the majority of my prototyping, so when Pipes is slow, I feel as happy as if I’ve lost an unbacked up server… Brian Kelly would probably tell me I need to do a risk assessment… I’ve already done one: What Happens If Yahoo! Pipes Dies? – but I haven’t made a start on the contingency stuff that was considered there…