Using OAI-PMH as a Single Record Level Query Interface to Citeseer

Picking up on a query I raised in Citation Positioning, here’s a quick summary of an online discussion featuring variously @edsu, @epoz, @ostephens and myself (I’m the one who knows absolutely nothing…!)

The context is: can I use the OAI-PMH interface on Citeseer to grab record level machine readable results from Citeseer. Note that I donlt really want to harvest all the Citeseer data, pop it into a database of my own, and then run queries on that; I just want a quick and dirty API to make a handful of calls to particular queries for a proof of concept hack;-)

Here’s what the Citeseer HTML page looks like:

A citeseer results page

It has a URL of the form: http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.122.728

The tabbed results pages have their own URLs:

– Active Bibliography, of the form http://citeseer.ist.psu.edu/viewdoc/similar?doi=10.1.1.122.7284&type=ab
– Co-Citation, of the form http://citeseer.ist.psu.edu/viewdoc/similar?doi=10.1.1.122.7284&type=cc
– Clustered Documents, of the form http://citeseer.ist.psu.edu/viewdoc/similar?doi=10.1.1.122.7284&type=sc

Here’s what I’m guessing:
– the ‘front page’ results are links to papers that reference/cite the target article, ordered by the number of times that they themselves have been cited; this is a subset of the total set of papes that cite the target article;
– the Active Bibliography is a subset of the articles that are referenced from/cited by the target article that have themselves been recently cited elsewhere (?! I’m guessing – the Citeseer site doesn’t seem to provide an explanation anywhere?)
– the co-citations are… I have no idea? Other papers that have been cited by papers that cite the target paper?
– Clustered Documents – these seem to be other Citeseer records relating to the same paper; do they all have the same citation info? I have no idea?????

As far as the OAI interface goes, it seems we can grab an individual record using a query of the form:

http://citeseerx.ist.psu.edu/oai2?verb=GetRecord&identifier=oai:CiteSeerX.psu:10.1.1.122.7284&metadataPrefix=oai_dc

which returns a result of the form:

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2011-12-08T16:24:31+00:00</responseDate>
<request identifier="oai:CiteSeerX.psu:10.1.1.122.7284" metadataPrefix="oai_dc" verb="GetRecord">http://citeseerx.ist.psu.edu/oai2</request>
<GetRecord>
<record>
<header>
<identifier>oai:CiteSeerX.psu:10.1.1.122.7284</identifier>
<datestamp>2009-05-28</datestamp>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>The structure and function of complex networks</dc:title>
<dc:creator>M. E. J. Newman</dc:creator>
<dc:description>
Inspired by empirical studies of networked systems such as the Internet, social networks, and biological networks, researchers have in recent years developed a variety of techniques and models to help us understand or predict the behavior of these systems. Here we review developments in this field, including such concepts as the small-world effect, degree distributions, clustering, network correlations, random graph models, models of network growth and preferential attachment, and dynamical processes taking place on networks.
</dc:description>
<dc:contributor>
The Pennsylvania State University CiteSeerX Archives
</dc:contributor>
<dc:publisher/>
<dc:date>2009-05-28</dc:date>
<dc:date>2008-12-04</dc:date>
<dc:date>2003</dc:date>
<dc:format>application/pdf</dc:format>
<dc:type>text</dc:type>
<dc:identifier>
http://citeseerx.ist.psu.edu/citeseerx/viewdoc/summary?doi=10.1.1.122.7284
</dc:identifier>
<dc:source>
http://www.cs.berkeley.edu/~christos/classics/graphsurvey.pdf
</dc:source>
<dc:language>en</dc:language>
<dc:relation>10.1.1.109.4049</dc:relation>
<dc:relation>10.1.1.120.3875</dc:relation>
<dc:relation>10.1.1.31.1768</dc:relation>
<dc:relation>10.1.1.153.5943</dc:relation>
<dc:relation>10.1.1.37.234</dc:relation>
<dc:relation>10.1.1.18.2720</dc:relation>
<dc:relation>10.1.1.30.6583</dc:relation>
<dc:relation>10.1.1.25.5619</dc:relation>
<dc:relation>10.1.1.104.3739</dc:relation>
<dc:relation>10.1.1.56.6742</dc:relation>
<dc:relation>10.1.1.117.7097</dc:relation>
<dc:relation>10.1.1.15.8793</dc:relation>
<dc:relation>10.1.1.33.1635</dc:relation>
<dc:relation>10.1.1.139.1580</dc:relation>
<dc:relation>10.1.1.30.9552</dc:relation>
<dc:relation>10.1.1.184.8874</dc:relation>
<dc:relation>10.1.1.24.6195</dc:relation>
<dc:relation>10.1.1.16.478</dc:relation>
<dc:relation>10.1.1.31.3763</dc:relation>
<dc:relation>10.1.1.25.7011</dc:relation>
<dc:relation>10.1.1.37.5917</dc:relation>
<dc:relation>10.1.1.84.9512</dc:relation>
<dc:relation>10.1.1.7.1950</dc:relation>
<dc:relation>10.1.1.129.6877</dc:relation>
<dc:relation>10.1.1.25.1360</dc:relation>
<dc:relation>10.1.1.16.1168</dc:relation>
<dc:relation>10.1.1.115.8316</dc:relation>
<dc:relation>10.1.1.143.1502</dc:relation>
<dc:relation>10.1.1.130.1956</dc:relation>
<dc:relation>10.1.1.20.814</dc:relation>
<dc:relation>10.1.1.21.838</dc:relation>
<dc:relation>10.1.1.16.2407</dc:relation>
<dc:relation>10.1.1.23.9684</dc:relation>
<dc:relation>10.1.1.62.7557</dc:relation>
<dc:relation>10.1.1.16.6906</dc:relation>
<dc:relation>10.1.1.2.4033</dc:relation>
<dc:relation>10.1.1.43.7796</dc:relation>
<dc:relation>10.1.1.25.1174</dc:relation>
<dc:relation>10.1.1.10.4509</dc:relation>
<dc:relation>10.1.1.27.3417</dc:relation>
<dc:relation>10.1.1.120.9902</dc:relation>
<dc:relation>10.1.1.20.5323</dc:relation>
<dc:relation>10.1.1.86.8584</dc:relation>
<dc:relation>10.1.1.3.3888</dc:relation>
<dc:relation>10.1.1.1.9569</dc:relation>
<dc:relation>10.1.1.78.4413</dc:relation>
<dc:relation>10.1.1.142.7059</dc:relation>
<dc:relation>10.1.1.161.114</dc:relation>
<dc:relation>10.1.1.143.1242</dc:relation>
<dc:relation>10.1.1.58.2706</dc:relation>
<dc:relation>10.1.1.35.8293</dc:relation>
<dc:relation>10.1.1.85.7061</dc:relation>
<dc:relation>10.1.1.129.709</dc:relation>
<dc:relation>10.1.1.16.5260</dc:relation>
<dc:relation>10.1.1.7.4603</dc:relation>
<dc:relation>10.1.1.37.2417</dc:relation>
<dc:relation>10.1.1.37.2641</dc:relation>
<dc:relation>10.1.1.117.3665</dc:relation>
<dc:relation>10.1.1.122.6034</dc:relation>
<dc:relation>10.1.1.11.7594</dc:relation>
<dc:relation>10.1.1.20.9298</dc:relation>
<dc:relation>10.1.1.27.4715</dc:relation>
<dc:relation>10.1.1.94.2340</dc:relation>
<dc:relation>10.1.1.196.2257</dc:relation>
<dc:relation>10.1.1.1.2728</dc:relation>
<dc:relation>10.1.1.58.3869</dc:relation>
<dc:relation>10.1.1.33.6972</dc:relation>
<dc:relation>10.1.1.35.4242</dc:relation>
<dc:relation>10.1.1.28.9399</dc:relation>
<dc:relation>10.1.1.12.2717</dc:relation>
<dc:relation>10.1.1.6.61</dc:relation>
<dc:relation>10.1.1.7.6756</dc:relation>
<dc:relation>10.1.1.15.4857</dc:relation>
<dc:relation>10.1.1.58.2087</dc:relation>
<dc:relation>10.1.1.10.352</dc:relation>
<dc:relation>10.1.1.110.6845</dc:relation>
<dc:rights>
Metadata may be used without restrictions as long as the oai identifier remains attached to it.
</dc:rights>
</oai_dc:dc>
</metadata>
</record>
</GetRecord>
</OAI-PMH>

I’m guessing the dc:relation elements refer to the papers listed on the ‘front page’ of the results for a given paper, that is, they are the most heavily cited papers that cite the target paper?

So a few questions that arise:

– what do the different results listings on the HTML pages actually refer to?
– what do the results in the OAI query above relate to?
– is it possible to get a list of all the papers cited/referenced by a target article? (Or failing that, is it possible to get hold of the Active Bibliography relations, which are presumably a subset of the complete set of bibliographic references contained within a paper?)
– is it possible to get a list of all the paper that cite/reference a particular target article?

If you can answer any or all of the above questions, please feel free to post the answer(s) in a comment below…:-)

One comment