Prompted by a request at the end of last week for some Twitter data I no longer have to hand, I revisited an old notebook script to try to tidy up some of my Twitter data grabbin’n’mungin’ scripts and have a quick play with some new toys, such as the pyLDAvis [demo] tool for analysing topic models in a notebook setting. (I generated my test models using gensimPy3, a Python 3 port of gensim, which all seemed to work okay…More on that in a later post, maybe…)
I also plugged in some entity extracting API calls for IBM’s Alchemy API and Thomson Reuters’ OpenCalais API. Both of these services provide a JSON response – and both are rate limited – so I’m cacheing responses for a while in a couple of MongoDB collections (one collection per service).
Here’s an example of the sort of response we can get from a call to Open Calais:
[{'_id': 213808950, 'description': 'Daily Telegraph sub-editor, Newcastle United follower and future England cricketer', 'reuters_entities': [{'text': 'follower and future England cricketer', 'type': 'Position'}, {'text': 'Daily Telegraph', 'type': 'Company'}, {'text': 'sub-editor', 'type': 'Position'}, {'text': 'Daily Telegraph', 'type': 'PublishedMedium'}], 'screen_name': 'laurieallsopp'}]
Looking at that data, which I retrieved from my MongoDB (the reuters_entities are the bits I got back from the OpenCalais API), I wondered how I could query the database to pull back just the Position info, or just bios associated with a PublishedMedium.
It turns out that the $elemMatch property was the one I needed to allow what is essentially a wildcarded search into the path of the list of arrays (it can be nested if you need to search deeper…):
load_from_mongo('twitter','calaisdata', criteria={'reuters_entities':{'$elemMatch':{'type':'PublishedMedium'}}}, projection={'screen_name':1, 'description':1,'reuters_entities.text':1, 'reuters_entities':{'$elemMatch':{'type':'PublishedMedium'}}})
In that example, the criteria definition limits returned records to those of type PublishedMedium, and the projection is used to return the first such matching element.
I can also run queries on job titles, or roles, as this example using grabbed AlchemyAPI data shows:
load_from_mongo('twitter','alchemydata', criteria={'ibm_entities':{'$elemMatch':{'type':'JobTitle'}}}, projection={'screen_name':1, 'description':1,'ibm_entities.text':1, 'ibm_entities':{'$elemMatch':{'type':'JobTitle'}}})
And so it follows that I could try to search for folk tagged as an editor (or variant thereof: editorial director, for example), modifying the query to additionally perform a case-insensitive search (I’m using pymongo to query the database):
criteria={'ibm_entities':{'$elemMatch':{'type':'JobTitle', 'text':{ '$regex' : 'editor', '$options':'i' }}}}
For a case insensitive but otherwise exact search, use an expression of the form "^editor$" to force the search on the tag to match at the start (^) and end ($) of the string…
I’m not sure if such use of the entity data complies with the license terms though!
And of course, it would probably be much easier to just lookup whether the description contains a particular word or phrase!