OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Posts Tagged ‘grefine

Grabbing Twitter Search Results into Google Refine And Exporting Conversations into Gephi

How can we get a quick snapshot of who’s talking to whom on Twitter in the context of a particular hashtag?

Here’s a quick recipe that shows how…

First we need to grab some search data. The Twitter API documentation provides us with some clues about how to construct a web address/URL that will grab results back from a particular search on Twitter in a machine readable way (that is, as data):

  • http://search.twitter.com/search.format is the base URL, and the format we require is json, which gives us http://search.twitter.com/search.json
  • the query we want is presented using the q= parameter: http://search.twitter.com/search.json?q=searchterm
  • if we want multiple search terms (for example, library skills), they need encoding in a particular way. The easiest was is just to construct your URL, enter it into the location/URL bar of your browser and hit enter, or use a service such as this string encoder. The browser should encode the URL for you. (If the only punctuation in your search phrase are spaces, you can encode them yourself: just change each space to %20, to give something like library%20skills. If you want to encode the # in a hashtag, use %23
  • We want to get back as many results as are allowed at any one time (which happens to be 100), so set rpp=100, that is: http://search.twitter.com/search.json?q=library%20skills&rpp=100
  • results are paged (in the sense of different pages of Google search results, for example), which means we can ask for the first 100 results, the second 100 results and so on as far back as the most recent 1500 tweets (page 15 for rpp=100, or page 30 if we were using rpp=50 (since 15*100 = 30*50 = 1500): http://search.twitter.com/search.json?q=library%20skills&rpp=100&page=1

Clicking on Next provides us with a dialogue that will allow us to load the data from the URLs into Google Refine:

Clicking “Configure Parsing Options” loads the data and provides us with a preview of it:

If you inspect the data that is returned, you should see it has a repeating pattern. Hovering over the various elements allows you to identify what repeating part of the result we want to import. For example, we could just import each tweet:

Or we could import all the data fields – let’s grab them all:

If you click the highlighted text, or click “Update Preview View”, you can get a preview of how the data will appear. To return to the selection view, click “Pick Record Nodes”:

“Create Project” actually generates the project and pulls all the data in… The column names are a little messy, but we can tidy those:

Look for the from_user and to_user columns and rename them source and target respectively… (hovering over a column name pops up tooltip that shows the full column name):

For the example I’m going to describe, we don’t actually need to rename the columns, but it’s handy to know how to do it;-)

We can now filter out all the rows with a “null” value in the target column. It seems a bit fiddly at first, but you soon get used to the procedure… Select the text facet to pop up a window that show the unique elements in the target column and how often they occur. Sort the list by count, and click on the “null” element – it should be highlighted and its setting should appear as “exclude”. The column will now be showing elements in the column that have the null value:

Click on the “Invert” option and the column will now filter out all the “null” elements and only show the elements that have a non-null value – that is, tweets that have a “to_user” value (which is to say, those tweets were sent to a particular user). Here’s what we get:

Let’s now export the source and target data so we can get it into Gephi:

Deselect all the columns, and then select source and target columns; also deselect the ‘output column headers’ – we don’t need headers where this file is going…

Export the custom layout as CSV data:

We can now import this data into another application – Gephi. Gephi is a cross platform package for visualising networks. In the simplest case, it can import two column data files where each row represents two things that are connected to each other. In our case, we have connections between “source” and “target” Twitter names – that is, connections that show when one Twitter user in our search sample has sent a message to another.

Launch Gephi and from the file menu, open the file you exported from Google Refine:

We’ve now got our data into Gephi, where we can start to visualise it…

…but that is a post for another day… (or if you’re impatient, you can find some examples of how to drive Gephi here).

Written by Tony Hirst

October 2, 2012 at 4:45 pm

Posted in Tinkering

Tagged with , , ,

Merging Data Sets Based on Partially Matched Data Elements

A tweet from @coneee yesterday about merging two datasets using columns of data that don’t quite match got me wondering about a possible R recipe for handling partial matching. The data in question related to country names in a datafile that needed fusing with country names in a listing of ISO country codes,although the recipe I’m describing here is intended to be a general purpose partial matcher. (See the comments for alternative methods for the special case of reconciling country names with country codes.)

The original data set had the form:

RANK,ECONOMY,PERCENTAGE OF INDIVIDUALS USING THE INTERNET 2011
1,Iceland,95
2,Norway,94
3,Netherlands,92.3
4,Sweden,91
5,Luxembourg,90.9
6,Denmark,90
7,Finland,89.4
8,Qatar,86.2
9,New Zealand,86
10,Switzerland,85.2
11,Liechtenstein,85
12,S. Korea,83.8
13,Germany,83
14,Canada,83
15,United Kingdom,82
16,Antigua & Barbuda,82

I’m not sure what country code listing was being used, but it probably looked something like this list of ISO Country Codes by Country:

ANDORRA;AD
ANGOLA;AO
ANGUILLA;AI
ANTARCTICA;AQ
ANTIGUA AND BARBUDA;AG
ARGENTINA;AR
ARMENIA;AM
ARUBA;AW
AUSTRALIA;AU
AUSTRIA;AT
AZERBAIJAN;AZ
BAHAMAS;BS
BAHRAIN;BH

It’s easy enough to reduce all the country names to lowercase characters so that we can try to match them exactly without worrying about any differences in capitalisation, but how do we match country names that don’t match exactly – ANTIGUA AND BARBUDA and Antigua & Barbuda, for example, or Central African Rep. and Central African Republic?

One trick is to use one of the well known partial string matching algorithms, such as the Levenshtein Distance. Here’s a recipe I hacked together that first tries to find an exact match on country names by attempting to merge the two country lists directly, and then tries to partially match any remaining unmatched names in the original list. A signature function is constructed to help out the partial matching attempt that reduces all words in the country name to lower case, sorts them alphabetically, and then concatenates them with no adjoining spaces.

(The signature idea was cribbed from the fingerprint that is available in Google Refine and that I employed in Merging Datasets with Common Columns in Google Refine.)

[@downes commented that the code wasn't really written with clarity in mind - so I've added some comments...]

#Load in the data from the URLs:
PercentageUsingTheNet=read.csv(url('http://s.telegraph.co.uk/graphics/conrad/PercentageUsingTheNet.csv, encoding='MACROMAN'))
ccode=read.csv(url('http://www.iso.org/iso/country_names_and_code_elements_txt'),sep=';')

##Here's where the algorithm starts...
##I'm going to generate a signature from country names to reduce some of the minor differences between strings
##In this case:
### convert all characters to lower case (tolower())
### split the string into a vector (unlist()) of separate words (strsplit())
### sort the words alphabetically (sort())
### and then concatenate them with no spaces (paste(y,collapse='')).
##So for example, United Kingdom would become kingdomunited
##To extend this function, we might also remove stopwords such as 'the' and 'of', for example (not shown).
signature=function(x){
  sig=paste(sort(unlist(strsplit(tolower(x)," "))),collapse='')
  return(sig)
}

#The partialMatch function takes two wordlists as vectors (x,y) and an optional distance threshold (levDist)
#The aim is to find words in the second list (y) that match or partially match words in the first (x)
partialMatch=function(x,y,levDist=0.1){
  #Create a data framecontainind the signature for each word
  xx=data.frame(sig=sapply(x, signature),row.names=NULL)
  yy=data.frame(sig=sapply(y, signature),row.names=NULL)
  #Add the original words to the data frame too...
  xx$raw=x
  yy$raw=y
  #We only want words that have a signature...
  xx=subset(xx,subset=(sig!=''))

  #The first matching pass - are there any rows in the two lists that have exactly the same signature?
  xy=merge(xx,yy,by='sig',all=T)
  matched=subset(xy,subset=(!(is.na(raw.x)) & !(is.na(raw.y))))
  #?I think matched=xy[ complete.cases(raw.x,raw.y) ] might also work here?
  #Label the items with identical signatures as being 'Duplicate' matches
  matched$pass="Duplicate"

  #Grab the rows from the first list that were unmatched - that is, no matching item from the second list appears
  todo=subset(xy,subset=(is.na(raw.y)),select=c(sig,raw.x))
  #We've grabbed the signature and original raw text from the first list that haven't been matched up yet
  #Name the columns so we know what's what
  colnames(todo)=c('sig','raw')

  #This is the partial matching magic - agrep finds items in the second list that are within a 
  ## certain Levenshtein distance of items in the first list.
  ##Note that I'm finding the distance between signatures.
  todo$partials= as.character(sapply(todo$sig, agrep, yy$sig,max.distance = levDist,value=T))

  #Bring the original text into the partial match list based on the sig key.
  todo=merge(todo,yy,by.x='partials',by.y='sig')

  #Find the items that were actually partially matched, and pull out the columns relating to signatures and raw text
  partial.matched=subset(todo,subset=(!(is.na(raw.x)) & !(is.na(raw.y))),select=c("sig","raw.x","raw.y"))
  #Label these rows as partial match items
  partial.matched$pass="Partial"
  #Add the set of partially matched items to the set of duplicate matched items
  matched=rbind(matched,partial.matched)
  
  #Find the rows that still haven't been matched
  un.matched=subset(todo,subset=(is.na(raw.x)),select=c("sig","raw.x","raw.y"))

  #If there are any unmatched rows, add them to the list of matched rows, but labelled as such
  if (nrow(un.matched)>0){
    un.matched$pass="Unmatched"
    matched=rbind(matched,un.matched)
  }

  #Grab the columns of raw text from x and y from the matched list, along with how they were matched/are unmatched
  matched=subset(matched,select=c("raw.x","raw.y","pass"))
  #Ideally, the length of this should be the same as the length of valid rows in the original first list (x)

  return(matched)
}

#A rogue character in @coneee's data file borked things for me, so I originally needed to do a character code conversion first
#PercentageUsingTheNet$ECONOMY=iconv(PercentageUsingTheNet$ECONOMY)
#Loading the CSV with the encoding attribute set (h/t Kent Johnson) seemed to work properly though...

#Call the partial match function using two vectors
#The aim is to find items in the second vector that partially match items in the first
#The function returns the first list annotated with partial match items from the second and a match type
matches=partialMatch(PercentageUsingTheNet$ECONOMY,ccode$Country.Name)

As ever, this code was arrived at by tinkering and searching on Stack Overflow (using search terms along the lines of “partial match R” and “R levenshtein”). If you can improve on it, please do so and paste a link to the improved code, or a code fragment itself, in the comments below:-)

UPDATE: via the comments, the following suggestion that I don’t have time to check right now…
#Bring the original text into the partial match list based on the sig key.
-todo=merge(todo,yy,by.x=’partials’,by.y=’sig’)
+todo=merge(todo,yy,by.x=’partials’,by.y=’sig’,all.x=T)

#Label these rows as partial match items
-partial.matched$pass=”Partial”
+if (nrow(partial.matched) > 0) partial.matched$pass=”Partial”

Thanks:-)

When we run the script and look at the contents of the matches dataframe, this is an example of what we get:

This data frame can then be merged with the originally loaded data to give us the required country code annotations:

a=PercentageUsingTheNet
b=ccode
#Merge the original data set with the ISO country code country name keys
aa=merge(a,matches,by.x='ECONOMY',by.y='raw.x',all.x=T)
#Merge in the ISO country codes
aa=merge(aa,b,by.x='raw.y',by.y='Country.Name',all.x=T)
aa=subset(aa,select=c('ECONOMY','RANK','PERCENTAGE.OF.INDIVIDUALS.USING.THE.INTERNET.2011','ISO.3166.1.alpha.2.code'))

Something like this, for example:

Unfortunately, not every country in the original data set is matched:

In particular, here are the unmatched items and what they presumably should have been matched with:

Lao P.D.R. - LAO PEOPLE'S DEMOCRATIC REPUBLIC
Syria - SYRIAN ARAB REPUBLIC
S. Korea - KOREA, REPUBLIC OF
Bolivia - BOLIVIA, PLURINATIONAL STATE OF
Russia - RUSSIAN FEDERATION
Guinea Bissau - GUINEA-BISSAU
St. Vincent & Grenadines - SAINT VINCENT AND THE GRENADINES
S. Sudan - SOUTH SUDAN
Eq. Guinea - EQUATORIAL GUINEA
Congo (Dem. Rep.) - CONGO, THE DEMOCRATIC REPUBLIC OF THE
Congo (Kinshasa) - CONGO
Slovak Republic - SLOVAKIA
Iran - IRAN, ISLAMIC REPUBLIC OF
TFYR Macedonia - MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF

We could try to increase the Levenshtein Distance within which a partial match is suggested, but then we run the risk of introducing false positives.

Removing stopwords as part of the signature function may help, for example in the case of St. Vincent & Grenadines. (Also, I’m not sure what went wrong with Guinea Bissau?) We could presumably also make something of the abbreviations (S., Dem., Rep., Eq., P. D. R.), using these as NGrams or stubs in a regular expression? So for example, convert Lao P. D. R. to a regular expression invoking searches for a three word phrase in which consecutive words start with P, D, R. (Of course, this would require matching across the full country name rather than the signature.)

The Google Refine documentation page Clustering In Depth (Methods and theory behind the clustering functionality in Google Refine) also identifies several other fuzzy/partial matching algorithms that could be tried here. In which case, I guess I need to rewrite the above function so that it can accommodate different matching algorithms? IF you manage to implement any of these other fuzzy matching approaches, please post a link in the comments.

PS If you use any other approaches for partial matching of strings, please feel free to share how you do it in the comments below…:-) For example:
theBioBucket: Merging Dataframes by Partly Matching String. This post also introduces adist(), an R function that returns the Levenshtein distance between two strings.

Written by Tony Hirst

September 26, 2012 at 9:00 pm

Data Shaping in Google Refine – Generating New Rows from Multiple Values in a Single Column

One of the things I’ve kept stumbling over in Google Refine is how to use it to reshape a data set, so I had a little play last week and worked out a couple of new (to me) recipes.

The first relates to reshaping data by creating new rows based on columns. For example, suppose we have a data set that has rows relating to Olympics events, and columns relating to Medals, with cell entries detailing the country that won each medal type:

However, suppose that you need to get the data into a different shape – maybe one line per country with an additional column specifying the medal type. Something like this, for example:

How can we generate that sort of view from the original data set? Here’s one way, that works when the columns you want to split into row values are contiguous (that is, next to each other). From the first column in the set of columns you want to be transformed, select Transpose > Transpose cells across columns into rows:

We now set the original selected column headers to be the cell value within a new column – MedalType – and the original cell values the value within a Country column:

(Note that we could also just transform the data into a single column. For example, suppose we had columns relating to courses currently taken by a particular student (Course 1, Course 2, Course 3), with a course code as cell value and one, two or three columns populated per student. If we wanted one row per student per course, we could just map the three columns onto a single column – CourseCode – and assign multiple rows to each student, then filtering out rows with a blank value in the CourseCOde column as required.)

Ticking the Fill down in other columns checkbox ensures that the appropriate Sport and Event values are copied in to the newly created rows:

Having worked out how to do that oft-required bit of data reshaping, I thought I could probably have another go at something that has been troubling me for ages – how to generate multiple rows from a single row where one of the columns contains JSON data (maybe pulled from a web service/API) that contains multiple items. This is a “mate in three” sort of problem, so here’s how I started to try to work it back. Given that I now know how to map columns onto rows, can I work out how to map different results in a JSON response onto different columns?

For example, here’s a result from the Facebook API for a search on a particular OU course code and the word open in a Facebook group name:

{“data”:[{"version":1,"name":"U101 (Open University) start date February 2012","id":"325165900838311"},{"version":1,"name":"Open university, u101- design thinking, October 2011","id":"250227311674865"},{"version":1,"name":"Feb 2011 Starters U101 Design Thinking - Open University","id":"121552081246861"},{"version":1,"name":"Open University - U101 Design Thinking, Feburary 2011","id":"167769429928476"}],”paging”:{“next”:…etc…}}

It returns a couple of results in the data element, in particular group name and group ID. Here’s one way I found of creating one row per group… Start off by creating a new column based on the JSON data column that parses the results in the data element into a list:

We can then iterate over the list items in this new column using the forEach grel command. The join command then joins the elements within each list item, specifically the group ID and name values in each result:

forEach(value.parseJson(),v,[v.id,v.name].join('||'))

You’ll notice that for multiple results, this produces a list of joined items, which we can also join together by extending the GREL expression:

forEach(value.parseJson(),v,[v.id,v.name].join('||')).join('::')

We now have a column that contains ‘||’ and ‘::’ separated items – :: separates individual group results from each other, || separates the id and name for each particular group.

Given we know how to create rows from multiple columns, we could try to split this column into separate columns using Edit column > Split into separate columns. This would create one column per result, which we could then transform into rows, as we did above. Whilst I don’t recommend this route in this particular case, here’s how we could go about doing it…

A far better approach is to use the Edit cells > split multi-valued cells option to automatically create new rows based on splitting the elements in a single column:

Note, however that this creates blanks in the other columns, so we need to Edit cells > Fill down to fill in the blanks in any other columns we want to refer to. After doing that, we end up with something like this:

We could now split the groupPairs column using the || separator to create two columns – Group ID and group name – giving us one row per group, and separate columns identifying the course, group name and group ID.

If the above route seems a little complicated, fear not…Once you apply it, it starts to make sense!

Written by Tony Hirst

July 30, 2012 at 11:50 am

Social Interest Positioning – Visualising Facebook Friends’ Likes With Data Grabbed Using Google Refine

What do my Facebook friends have in common in terms of the things they have Liked, or in terms of their music or movie preferences? (And does this say anything about me?!) Here’s a recipe for visualising that data…

After discovering via Martin Hawksey that the recent (December, 2011) 2.5 release of Google Refine allows you to import JSON and XML feeds to bootstrap a new project, I wondered whether it would be able to pull in data from the Facebook API if I was logged in to Facebook (Google Refine does run in the browser after all…)

Looking through the Facebook API documentation whilst logged in to Facebook, it’s easy enough to find exemplar links to things like your friends list (https://graph.facebook.com/me/friends?access_token=A_LONG_JUMBLE_OF_LETTERS) or the list of likes someone has made (https://graph.facebook.com/me/likes?access_token=A_LONG_JUMBLE_OF_LETTERS); replacing me with the Facebook ID of one of your friends should pull down a list of their friends, or likes, etc.

(Note that validity of the access token is time limited, so you can’t grab a copy of the access token and hope to use the same one day after day.)

Grabbing the link to your friends on Facebook is simply a case of opening a new project, choosing to get the data from a Web Address, and then pasting in the friends list URL:

Google Refine - import Facebook friends list

Click on next, and Google Refine will download the data, which you can then parse as a JSON file, and from which you can identify individual record types:

Google Refine - import Facebook friends

If you click the highlighted selection, you should see the data that will be used to create your project:

Google Refine - click to view the data

You can now click on Create Project to start working on the data – the first thing I do is tidy up the column names:

Google Refine - rename columns

We can now work some magic – such as pulling in the Likes our friends have made. To do this, we need to create the URL for each friend’s Likes using their Facebook ID, and then pull the data down. We can use Google Refine to harvest this data for us by creating a new column containing the data pulled in from a URL built around the value of each cell in another column:

Google Refine - new column from URL

The Likes URL has the form https://graph.facebook.com/me/likes?access_token=A_LONG_JUMBLE_OF_LETTERS which we’ll tinker with as follows:

Google Refine - crafting URLs for new column creation

The throttle control tells Refine how often to make each call. I set this to 500ms (that is, half a second), so it takes a few minutes to pull in my couple of hundred or so friends (I don’t use Facebook a lot;-). I’m not sure what limit the Facebook API is happy with (if you hit it too fast (i.e. set the throttle time too low), you may find the Facebook API stops returning data to you for a cooling down period…)?

Having imported the data, you should find a new column:

Google Refine - new data imported

At this point, it is possible to generate a new column from each of the records/Likes in the imported data… in theory (or maybe not..). I found this caused Refine to hang though, so instead I exprted the data using the default Templating… export format, which produces some sort of JSON output…

I then used this Python script to generate a two column data file where each row contained a (new) unique identifier for each friend and the name of one of their likes:

import simplejson,csv

writer=csv.writer(open('fbliketest.csv','wb+'),quoting=csv.QUOTE_ALL)

fn='my-fb-friends-likes.txt'

data = simplejson.load(open(fn,'r'))
id=0
for d in data['rows']:
	id=id+1
	#'interests' is the column name containing the Likes data
	interests=simplejson.loads(d['interests'])
	for i in interests['data']:
		print str(id),i['name'],i['category']
		writer.writerow([str(id),i['name'].encode('ascii','ignore')])

[I think this R script, in answer to a related @mhawksey Stack Overflow question, also does the trick: R: Building a list from matching values in a data.frame]

I could then import this data into Gephi and use it to generate a network diagram of what they commonly liked:

Sketching common likes amongst my facebook friends

Rather than returning Likes, I could equally have pulled back lists of the movies, music or books they like, their own friends lists (permissions settings allowing), etc etc, and then generated friends’ interest maps on that basis.

[See also: Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part I and how to visualise Google+ networks]

PS dropping out of Google Refine and into a Python script is a bit clunky, I have to admit. What would be nice would be to be able to do something like a “create new rows with new column from column” pattern that would let you set up an iterator through the contents of each of the cells in the column you want to generate the new column from, and for each pass of the iterator: 1) duplicate the original data row to create a new row; 2) add a new column; 3) populate the cell with the contents of the current iteration state. Or something like that…

PPS Related to the PS request, there is a sort of related feature in the 2.5 release of Google Refine that lets you merge data from across rows with a common key into a newly shaped data set: Key/value Columnize. Seeing this, it got me wondering what a fusion of Google Refine and RStudio might be like (or even just R support within Google Refine?)

PPPS this could be interesting – looks like you can test to see if a friendship exists given two Facebook user IDs.

PPPPS This paper in PNAS – Private traits and attributes are predictable from digital records of human behavior – by Kosinski et. al suggests it’s possible to profile people based on their Likes. It would be interesting to compare how robust that profiling is, compared to profiles based on the common Likes of a person’s followers, or the common likes of folk in the same Facebook groups as an individual?

Written by Tony Hirst

January 4, 2012 at 11:06 am

Follow

Get every new post delivered to your Inbox.

Join 787 other followers