Picking up on a query I raised in Citation Positioning, here’s a quick summary of an online discussion featuring variously @edsu, @epoz, @ostephens and myself (I’m the one who knows absolutely nothing…!)
The context is: can I use the OAI-PMH interface on Citeseer to grab record level machine readable results from Citeseer. Note that I donlt really want to harvest all the Citeseer data, pop it into a database of my own, and then run queries on that; I just want a quick and dirty API to make a handful of calls to particular queries for a proof of concept hack;-)
Here’s what the Citeseer HTML page looks like:
It has a URL of the form: http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.122.728
The tabbed results pages have their own URLs:
– Active Bibliography, of the form http://citeseer.ist.psu.edu/viewdoc/similar?doi=10.1.1.122.7284&type=ab
– Co-Citation, of the form http://citeseer.ist.psu.edu/viewdoc/similar?doi=10.1.1.122.7284&type=cc
– Clustered Documents, of the form http://citeseer.ist.psu.edu/viewdoc/similar?doi=10.1.1.122.7284&type=sc
Here’s what I’m guessing:
– the ‘front page’ results are links to papers that reference/cite the target article, ordered by the number of times that they themselves have been cited; this is a subset of the total set of papes that cite the target article;
– the Active Bibliography is a subset of the articles that are referenced from/cited by the target article that have themselves been recently cited elsewhere (?! I’m guessing – the Citeseer site doesn’t seem to provide an explanation anywhere?)
– the co-citations are… I have no idea? Other papers that have been cited by papers that cite the target paper?
– Clustered Documents – these seem to be other Citeseer records relating to the same paper; do they all have the same citation info? I have no idea?????
As far as the OAI interface goes, it seems we can grab an individual record using a query of the form:
which returns a result of the form:
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2011-12-08T16:24:31+00:00</responseDate> <request identifier="oai:CiteSeerX.psu:10.1.1.122.7284" metadataPrefix="oai_dc" verb="GetRecord">http://citeseerx.ist.psu.edu/oai2</request> <GetRecord> <record> <header> <identifier>oai:CiteSeerX.psu:10.1.1.122.7284</identifier> <datestamp>2009-05-28</datestamp> </header> <metadata> <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>The structure and function of complex networks</dc:title> <dc:creator>M. E. J. Newman</dc:creator> <dc:description> Inspired by empirical studies of networked systems such as the Internet, social networks, and biological networks, researchers have in recent years developed a variety of techniques and models to help us understand or predict the behavior of these systems. Here we review developments in this field, including such concepts as the small-world effect, degree distributions, clustering, network correlations, random graph models, models of network growth and preferential attachment, and dynamical processes taking place on networks. </dc:description> <dc:contributor> The Pennsylvania State University CiteSeerX Archives </dc:contributor> <dc:publisher/> <dc:date>2009-05-28</dc:date> <dc:date>2008-12-04</dc:date> <dc:date>2003</dc:date> <dc:format>application/pdf</dc:format> <dc:type>text</dc:type> <dc:identifier> http://citeseerx.ist.psu.edu/citeseerx/viewdoc/summary?doi=10.1.1.122.7284 </dc:identifier> <dc:source> </dc:source> <dc:language>en</dc:language> <dc:relation>10.1.1.109.4049</dc:relation> <dc:relation>10.1.1.120.3875</dc:relation> <dc:relation>10.1.1.31.1768</dc:relation> <dc:relation>10.1.1.153.5943</dc:relation> <dc:relation>10.1.1.37.234</dc:relation> <dc:relation>10.1.1.18.2720</dc:relation> <dc:relation>10.1.1.30.6583</dc:relation> <dc:relation>10.1.1.25.5619</dc:relation> <dc:relation>10.1.1.104.3739</dc:relation> <dc:relation>10.1.1.56.6742</dc:relation> <dc:relation>10.1.1.117.7097</dc:relation> <dc:relation>10.1.1.15.8793</dc:relation> <dc:relation>10.1.1.33.1635</dc:relation> <dc:relation>10.1.1.139.1580</dc:relation> <dc:relation>10.1.1.30.9552</dc:relation> <dc:relation>10.1.1.184.8874</dc:relation> <dc:relation>10.1.1.24.6195</dc:relation> <dc:relation>10.1.1.16.478</dc:relation> <dc:relation>10.1.1.31.3763</dc:relation> <dc:relation>10.1.1.25.7011</dc:relation> <dc:relation>10.1.1.37.5917</dc:relation> <dc:relation>10.1.1.84.9512</dc:relation> <dc:relation>10.1.1.7.1950</dc:relation> <dc:relation>10.1.1.129.6877</dc:relation> <dc:relation>10.1.1.25.1360</dc:relation> <dc:relation>10.1.1.16.1168</dc:relation> <dc:relation>10.1.1.115.8316</dc:relation> <dc:relation>10.1.1.143.1502</dc:relation> <dc:relation>10.1.1.130.1956</dc:relation> <dc:relation>10.1.1.20.814</dc:relation> <dc:relation>10.1.1.21.838</dc:relation> <dc:relation>10.1.1.16.2407</dc:relation> <dc:relation>10.1.1.23.9684</dc:relation> <dc:relation>10.1.1.62.7557</dc:relation> <dc:relation>10.1.1.16.6906</dc:relation> <dc:relation>10.1.1.2.4033</dc:relation> <dc:relation>10.1.1.43.7796</dc:relation> <dc:relation>10.1.1.25.1174</dc:relation> <dc:relation>10.1.1.10.4509</dc:relation> <dc:relation>10.1.1.27.3417</dc:relation> <dc:relation>10.1.1.120.9902</dc:relation> <dc:relation>10.1.1.20.5323</dc:relation> <dc:relation>10.1.1.86.8584</dc:relation> <dc:relation>10.1.1.3.3888</dc:relation> <dc:relation>10.1.1.1.9569</dc:relation> <dc:relation>10.1.1.78.4413</dc:relation> <dc:relation>10.1.1.142.7059</dc:relation> <dc:relation>10.1.1.161.114</dc:relation> <dc:relation>10.1.1.143.1242</dc:relation> <dc:relation>10.1.1.58.2706</dc:relation> <dc:relation>10.1.1.35.8293</dc:relation> <dc:relation>10.1.1.85.7061</dc:relation> <dc:relation>10.1.1.129.709</dc:relation> <dc:relation>10.1.1.16.5260</dc:relation> <dc:relation>10.1.1.7.4603</dc:relation> <dc:relation>10.1.1.37.2417</dc:relation> <dc:relation>10.1.1.37.2641</dc:relation> <dc:relation>10.1.1.117.3665</dc:relation> <dc:relation>10.1.1.122.6034</dc:relation> <dc:relation>10.1.1.11.7594</dc:relation> <dc:relation>10.1.1.20.9298</dc:relation> <dc:relation>10.1.1.27.4715</dc:relation> <dc:relation>10.1.1.94.2340</dc:relation> <dc:relation>10.1.1.196.2257</dc:relation> <dc:relation>10.1.1.1.2728</dc:relation> <dc:relation>10.1.1.58.3869</dc:relation> <dc:relation>10.1.1.33.6972</dc:relation> <dc:relation>10.1.1.35.4242</dc:relation> <dc:relation>10.1.1.28.9399</dc:relation> <dc:relation>10.1.1.12.2717</dc:relation> <dc:relation>10.1.1.6.61</dc:relation> <dc:relation>10.1.1.7.6756</dc:relation> <dc:relation>10.1.1.15.4857</dc:relation> <dc:relation>10.1.1.58.2087</dc:relation> <dc:relation>10.1.1.10.352</dc:relation> <dc:relation>10.1.1.110.6845</dc:relation> <dc:rights> Metadata may be used without restrictions as long as the oai identifier remains attached to it. </dc:rights> </oai_dc:dc> </metadata> </record> </GetRecord> </OAI-PMH>
I’m guessing the dc:relation elements refer to the papers listed on the ‘front page’ of the results for a given paper, that is, they are the most heavily cited papers that cite the target paper?
So a few questions that arise:
– what do the different results listings on the HTML pages actually refer to?
– what do the results in the OAI query above relate to?
– is it possible to get a list of all the papers cited/referenced by a target article? (Or failing that, is it possible to get hold of the Active Bibliography relations, which are presumably a subset of the complete set of bibliographic references contained within a paper?)
– is it possible to get a list of all the paper that cite/reference a particular target article?
If you can answer any or all of the above questions, please feel free to post the answer(s) in a comment below…:-)
Related – via @mhawksey – How can you parse xml in Google Refine using jython/python ElementTree [ http://stackoverflow.com/questions/8513709/how-can-you-parse-xml-in-google-refine-using-jython-python-elementtree ]