In a couple of days, I’ll be at RepoFringe 2010, where the emphasis this year will be on ” OPEN: Open Data; Open Access; Open Learning; Open Knowledge; Open Content; etc…” I’m writing this post (a scheduled post, written over the weekend) in advance of putting my presentation together – so I’m not sure what it’ll be on yet! – but to get myself into the swing of things I started looking at what some of the repository bloggers have been thinking about lately, with a view to maybe doing a quick hack inspired by one or more of their posts…
…and it didn’t take long to find an itch to scratch… In a couple of recent posts looking at the extent to which personal document and metadata collections using apps like Mendeley might be seen as a figure:ground complement to a repository, (Comparing Social Sharing of Bibliographic Information with Institutional Repositories, More on Mendeley and Repositories), Les Carr started to explore “the extent of Mendeley’s penetration into a University. What is visible is the public profiles that Mendeley users have created. Although the Mendeley API doesn’t allow searching for users, I have been able to identify 53 public profiles from the University of Cambridge through Google (and a lot of manual verification!)” [my emphasis].
Hmmm… Sounds like that was a bit of a chore… can we finesse an API for that, I wonder?
To see how I put this Pipe together, let’s see what Google gives us (I’m limiting the search to mendeley.com because that’s where I want to find the profiles):
Note that there are several useful things we can spot simply from inspection of the Google search results:
– user profile information on Mendeley is located on URLs of the form http://www.mendeley.com/profiles/, so we can refine the site: search limit to take that into account (i.e. by using the limit site:www.mendeley.com/profiles/);
– the insitution name, if appropriately declared, appears in the page title, which gives the headline search result in Google results listings; so we can use search limits of the form intitle:”cambridge university”, or the more general intitle:cambridge
– sometimes (not shown in the image above), our search term appears in the title, but it’s the wrong one… So for example, if we have researchers in “Cambridge Massachusetts”, we may want to exclude results with Massachusetts in the title by negating an intitle limit: -intitle:massachusetts”
Putting those techniques together, and to test things out, we should be able to search for members of our institution using something like: site:www.mendeley.com/profiles/ intitle:”cambridge” -intitle:massachusetts
What else can we learn just by looking at the search results?
– if somebody’s surname matches the institution name, that may be returned as a result (e.g. Darren Cambridge). If we inspect the title, we see it has a regular structure: Name – Institution. Having got the results from Google, if we strip the name out of the title to leave just the affiliation, and then filter the results again to check that the search term appears in the affiliation, we can remove these false positives. (I have used this “double dip” search-then-filter approach in other contexts. For example, Paragraph Level Search Results on WordPress Using Digress.it and Yahoo Pipes.)
We’re now in a position to build a Yahoo pipe to create some sort of API to provide a Mendeley status search. A good way of getting Google search results into the Pipes environment is to use Google’s AJAX search API. The Google AJAX search API returns either four or eight results at a time, along with an indication of how many other “pages” of results there are, as well as an index count that identifies the index count of the first result on a page. (So for example, on the first page, the index of the first result is 0. For pages with four results, the index of the first result on the second page is 4, 8 on the third, and so on.) The first results page is complete – we actually get the search results listed. But the API also provides a list of the other results pages available, and the index of the first result on each page. To call results from the later pages, given the index of the first result, we use the additional URL argument &start=index.
The first part of the Pipe constructs the AJAX API URL. The user inputs are a bit of a fudge (and the result of a bit of trial and error!) to try to support as clean a results set as possible by virtue of how we phrase the search query…
Here’s where we construct the URL, and then fetch the data:
Just by the by, we can use another (unsaved) pipe to act as a previewer for AJAX search results:
If we want out pipe to display the results from all the pages, we need to grab the list of responseData.cursor.pages and then generate the “more results” page for each one. So, grab the list of page and first result index data:
and then create a URL for each of these, before grabbing the results from each results page:
Note that we are using the same query string that we used in the original search. (Also note that we only seem to get at most 64 responses; maybe the page list for pages later than the first page provide indices for more results? That is, maybe each results page only lists at most indices for 8 pages of results?
Having got the search results, we rename the results attributes to generate valid RSS elements (title, description, link) and do a spot of post filtering.
Remember the case where the search result appeared becuase the institution name was actually someone’s surname? The Regular Expression block strips out the Mendeley user’s name and allows us to filter the results on just the affiliation, to remove those false positives (the filter lets through results where the institution search term appears in the title’s affiliation part):
The resulting pipe allows us to search for Mendeley users by institution:
(Having built the pipe, I think that an even more robust approach might be to tokenise the search terms required in the title and them add them as separate intitle: limits. So for example intitle:cambridge intitle:university would find pages where both Cambridge University and University of Cambridge appear in the page title.)
So that’s the pipe…
In many ways, it implements some sort of “stalker pattern” based on profile information that is released via title elements on personal profile pages. I’ve demonstrated a similar approach previously in A Couple of Twitter Search Tricks, which shows (courtesy of an update I added after a tweet from @daveyp) how to do a similar sort of search to find folk twittering with a university allegiance. In fact, here’s a pipe to do something that approximates to just that – Twitter profile search (via Google and Yahoo Pipes):
A quick scout round other social networks shows that this is a trick we can use widely:
- site:uk.linkedin.com/in intitle:smith (try using different country codes for the subdomian to search different countries)
- site:facebook.com intitle:smith
- site:myspace.com “milton keynes” intitle:”25- Female” (the original demo, hence “the stalker pattern” epithet!)
Unfortunately, it’s not obvious how to search for anything other than name on Slideshare, or Scribd (that is, there is no obviously easy way of searching for members of an institution on Slideshare). This in turn suggests to me that if you are developing a site with a social element, and you ant people to be able to use things like Google search to finesse additional, structured search functionality over the site (as in the Mendeley user profile search pipe), you should design title elements with all due consideration…
PS in his original post, Les Carr went on: “Incredibly, only TWO of those 53 researchers have any existing deposits in Cambridge’s institutional repository.” So maybe the next step would be to build some pipework to run Mendeley discovered users against corresponding institutional repositories?;-)