Fragments… Obtaining Individual Photo Descriptions from flickr Sets

I think I’ve probably left it too late to think up some sort of hack for the UK Discovery developer competition, but here’s a fragment that might provide a starting point for someone else… How to use a Yahoo pipe to grab a list of photos in a particular flickr set (such as one of the sets posted by the UK National Archive to the flickr commons)

The recipe makes use of two calls to the flickr api: one to get the a list of photos in a particular set, the second, repeatedly made call, to grab details down for each photo in the set.

In pseudo-code, we would write the algorithm along the lines of:

get list of photos in a given flickr set
for each photo in the set:
  get the details for the photo

Here’s the pipe:

Example of calling flickr api to obtain descriptions of photos in a flickr set

The first step is construct a call to the flickr API to pull down the photos in a given set. The API is called via a URI of the form:

The API returns a JSON object containing separate items identifying each photo in the set.

The rename block constructs a new attribute for each photo item (detailURI) containing the corresponding photo ID. The RegEx block applies a regular expression to each item’s detailURI attribute to transform it to a URI that calls the flickr API for details of a particular phot, by photo id. The call this time is of the form:

Finally, the Loop block runs through each item in the original set, calls the flickr API using the detailURI to get the details for the corresponding photo, and replaces each item with the full details of each photo.

flickr api - photo details

You can find the pipe here: grabbing photo details for photos in a flickr set

An obvious next step might be to enrich the phot decriptions with semantic tags using something like the Reuters OpenCalais service. On a quick demo, this didn’t seem to work in the pipes context (I wonder if there is Calais API throttling going on, or maybe a timeout?) but I’ve previously posted a recipe using Python that shows how to call the Open Calais service in a Python context: Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags.

Again in pseudo code, we might do something like:

get JSON feed out of Yahoo pipe
for each item:
  call the Calais API against the description element

We could then index the description, title, and semantic tags for each photo and use them to support search and linking between items in the set.

Other pivots potentially arise from identifying whether photos are members of other sets, and using the metadata associated with those other sets to support discovery, as well as information contained within comments associate with each photo.

Finding Common Terms around a Twitter Hashtag

@aendrew sent me a link to a StackExchange question he’s just raised, in a tweet asking: “Anyone know how to find what terms surround a Twitter trend/hashtag?”

I’ve dabbled in this area before, though not addressing this question exactly, using Yahoo Pipes to find what hashtags are being used around a particular search term (Searching for Twitter Hashtags and Finding Hashtag Communities) or by members of a particular list (What’s Happening Now: Hashtags on Twitter Lists; that post also links to a pipe that identifies names of people tweeting around a particular search term.).

So what would we need a pipe to do that finds terms surrounding a twitter hashtag?

Firstly, we need to search on the tag to pull back a list of tweets containing that tag. Then we need to split the tweets into atomic elements (i.e. separate words). At this point, it might be useful to count how many times each one occurs, and display the most popular. We might also need to generate a “stop list” containing common words we aren’t really interested in (for example, the or and.

So here’s a quick hack at a pipe that does just that (Popular words round a hashtag).

For a start, I’m going to construct a string tokeniser that just searches for 100 tweets containing a particular search term, and then splits each tweet up in separate words, where words are things that are separated by white space. The pipe output is just a list of all the words from all the tweets that the search returned:

Twitter string tokeniser

You might notice the pipe also allows us to choose which page of results we want…

We can now use the helper pipe in another pipe. Firstly, let’s grab the words from a search that returns 200 tweets on the same search term. The helper pipe is called twice, once for the first page of results, once for the second page of results. The wordlists from each search query are then merged by the union block. The Rename block relabels the .content attribute as the .title attribute of each feed item.

Grab 200 tweets and check we have set the title element

The next thing we’re going to do is identify and count the unique words in the combined wordlist using the Unique block, and then sort the list accord to the number of times each word occurs.

Preliminary parsing of a wordlist

The above pipe fragment also filters the wordlist so that only words containing alphabetic characters are allowed through, as well as words with four or more characters. (The regular expression .{4,} reads: allow any string of four or more ({4,}) characters of any type (.). An expression .{5,7} would say – allow words through with length 5 to 7 characters.)

I’ve also added a short routine that implements a stop list. The regular expression pattern (?i)\b(word1|word2|word3)\b says: ignoring case ((?i)),try to match any of the words word1, word2, word3. (\b denotes word boundary.) Note that in the filter below, some of the words in my stop list are redundant (the ones with three or fewer characters. Remember, we have already filtered the word list to show only words of length four or more characters.)

Stop list

I also added a user input that allows additional stop terms to be added (they should be pipe (|) separated, with no spaces between them). You can find the pipe here.