OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Library’ Category

What did you notice for the first time today?

A week late on posting this, catching up with Brian’s notes on the ILI 2013: Future Technologies and Their Applications Workshop workshop we ran last week, and his follow up – What Have You Noticed Recently? – inspired by not properly paying attention to what I had to say, here are few of my own reflections on what I heard myself saying at the event, along with additional (minor) comments around the set of ‘resource’ slides I’d prepped for the event, though I didn’t refer to many of them…

  • slides 2-6 – some thoughts on getting your eye into some tech trends: OU Innovating Pedagogy reports (2012, 2013), possible data-sources and reports;
  • slides 6-11 – what can we learn from Google Trends and related tools? A big thing: the importance of segmenting your stats; means are often meaningless. The Mothers’ Day example demonstrates two signal causes (in different territories – i.e. different segments) for the compound flowers trend. The Google Correlate example show how one signal may lead – or lag – another. So the question: do you segment your library data? Do you look for leading or lagging indicators?
  • slides 12-18 – what role should/does/could the library play in developing the reputation of the organisation’s knowledge producers/knowledge outputs, not least as a way of making them more discoverable; this builds on the question of whose role it is to facilitate access to knowledge (along with the question: facilitate access for whom?)? – my take is this fits in the role librarians often take of organising an institution’s knowledge.
  • slides 19-27 – what is a library for? Supporting discovery (of what, by whom)? (Helping others) organise knowledge, and gain access to information? Do research?
  • slides 28-30 – the main focus of my own presentation during the main ILI2013 conference (I’ll post the slides/brief commentary in another post): if the information we want to discover is buried in data, who’s there to help us extract or discover the information from within the data?
  • slides 31-32 – sometimes reframing your perception of an organisation’s offerings can help you rethink the proposition, and sometimes using an analogy helps you switch into that frame of mind. So if energy utilities provide “warm house” and “clean, dry clothes” service, rather than gas or electricity, what shift might libraries adopt?
  • slides 33-39 – a few idle idea prompts around the question of just what is it that libraries do, what services do they provide?
  • slide 40 – one of the items from this slide caused a nightmare tangent! The riff started with a trivial observation – a telling off I received for trying to use the phone on my camera to take a photo of a sign saying “no cameras in the library”, with a photocopier as a backdrop (original story). The purpose of this story was two-fold: 1) to get folk into the idea of spotting anachronisms or situations where one technology is acceptable where an equivalent or alternative is not (and then wonder why/what fun can be had around that thought;-); 2) to get folk into wondering how users might appropriate technology they have to hand to make their lives easier, even if it “goes against the rules”.
  • slide 41 – a thought experiment that I still have high hopes for in the right workshop setting…! if you overheard someone answer a question you didn’t hear with the phrase “did you try the library?”, what might the question be? You can then also pivot the question to identify possible competitors; for example, if a sensible answer to the same question is “did you try Amazon?”, Amazon might be a competitor for the delivery of that service.
  • slide 42 – this can lead on from the previous slide, either directly (replace “library” with “Amazon” or “Google”), or as way of generating ideas about how else a service might be delivered.

Slide not there – a riff on the question of: what did you notice for the first time today? This can be important for trend spotting – it may signify that something is becoming mainstream that you hadn’t appreciated before. To illustrate, I’ve started trying to capture the first time I spot tech in the wild with a photo, such as this one of an Amazon locker in a Co-Op in Cambridge, or a noticing from the first time I saw video screens on the Underground.

As with many idea generating techniques, things can be combined. For example, having introduced the notion of Amazon lockers, we might then ask: so what use might libraries make of such a system, or thing? Or if such things become commonplace, how might this affect or influence the expectations of our users??

Written by Tony Hirst

October 22, 2013 at 5:30 pm

Posted in Library, Presentation

Tagged with ,

via OER-DISCUSS – Notes on Copyright

I though this was handy on the OER-DISCUSS mailing list:

Our copyright officer writes:

… US Copyright ‘Fair Use’ or S29 copying for non-commercial research and private study which allows copying but the key word here is ‘private’. i.e. the provisos are that you don’t make the work or copies available to anyone else.

Although there are UK Exceptions for education, they are very limited or obsolete.
S.32 (1) and (2A) do have the proviso “is not done by reprographic process” which basically means that any copying by any mechanical means is excluded, i.e. you may only copy by hand.

S36 educational provision in law for reprographic copying is
a) only applicable to passages in published works i.e. books journals etc and
b) negated becauses licences are now available S.36 (3)

S.32 (2) permits only students studying courses in making Films or Film soundtracks to copy Film, broacasts or sound recordings.

The only educational exception students can rely on is s.32(3) for Examination athough this also is potentially restrictive. For the exception to apply, the work must count towards their final grade/award and any further dealing with the work after the examination process, becomes infringement.

I’m not sure how they are using Voicethread, but if the presentations are part of their assessed coursework and only available to students, staff and examiners on the course, they may use any Copyright protected content, provided it’s all removed from availability after the assessment (not sure how this works with cloud applications though)

There is also exception s.30 for Criticism or Review, which is a general exception for all, and the copying is necessary for a genuine critique or review of it.

If the students can’t rely on the last 3 exceptions, using Copyright free or licenced material (e.g. Creative Commons), would be highly recommended.

Kate Vasili – Copyright Officer, Middlesex University, Sheppard Library

Written by Tony Hirst

March 16, 2013 at 11:50 am

Open Research Data Processes: KMi Crunch – Hosted RStudio Analytics Environment

One of the possible barriers to widespread adoption of open notebook science is knowing where to start. Video reports of lab experiments hosted on Youtube can be easily embedded in a hosted WordPress blog; a MediaWiki wiki can be used to provide one page per experiment, with change tracking/history on each page and a shadow page for commentary and discussion; Github can be used to provide a version control environment for software code, results data, project pages and documentation. For tabulated data, Google Spreadsheets provides a hosting environment and an API that lets you treat the data as a database and also explore it dashboard style via a range of interactive visual filtering and charting components. Alternatively, a CKAN instance (such as is used to run thedatahub.org) offers data management and preview tools.

Keeping track of data analysis in an open way is also getting easier. In An R-chitecture for Reproducible Research/Reporting/Data Journalism, I briefly mentioned RPubs.com, a site that can be used to 1-click publish HTML reports of statistical analyses executed within the RStudio environment (I really need to do a proper post about this). But now there’s an example of another hosted solution from Fridolin Wild of the OU’s KMi: Crunch.

Crunch offers a hosted RStudio environment (so you can access RStudio via a browser) with public and private areas. The public areas allow you to post datasets, run scripts as a service, or publish results (Sweave generated PDFs, or knitr generated HTML reports, for example).

Crunch also incorporates a MySQL database for each user. (Scheduling and pipelining are also on the cards…)

Whilst developed as an application to support learning analytics (I think?), Crunch also provides a great demonstration of a more general open research data workbench. You can store – and publish – data sets, along with analysis scripts and reports generated by executing those scripts over your data set. Version control isn’t available at the moment (I think?) but RSTudio does have git/github support, so that may be coming. The provision of a MySql database means that data collections can be managed within a database environment. (From a data journalism, rather than an open/reproducible research, perspective, I did wonder whether it would be possible to situate something like Scraperwiki on the same platform and replace its SQLite support with MySQL support, so a Scraperwiki scraper could be used to scrape data into a MySQL database that was then accessed from RStudio? Being able to wire MySQL read/write access into Google Refine on the same platform could also be interesting..;-)

I’m not sure about the extent to which the OU LIbrary is taking an interest in the development of Crunch, but providing best practice support and advice in the orchestration of information and data handling tools seems to me to be in-scope for the academic research librarian, in much the same way as advising on the use of bibliography data management tools used to be…? (For a recent take on this, see Dorothea Salo’s recent Ariadne article Retooling Libraries for the Data Challenge.)

Written by Tony Hirst

August 23, 2012 at 10:24 am

jibs.ac.uk AGM Keynote – Revisiting the 5 Laws…

I had the honour of being invited to talk at the JIBS User Group 20 Anniversary AGM yesterday, and as well as having a bit of a rant in the closing plenary about opening up and making internal reuse of data and making FOI requests about SCONUL data*, I also gave this sideways take on Ranganathan’s Five Laws of Library Science for the current age (The Frictionless Library).

Amongst other things, the presentation sketches a possible project (that I think could make for a good workshop day) revisiting each of the laws in network context using the various techniques of constitutional interpretation and (briefly) revisits at least one of the notions of the Invisible Library (see also The Invisible Library (ILI, 2009), another meaningless set of slides…;-)

* Note to self: read up about the current HESA HE Information Landscape Project (Redesigning the higher education data and information landscape). Also check out the “KB+” JISC project (programme?) that will “develo[p] a shared community service that will improve the quality, accuracy, coverage and availability of data for the management, selection, licensing, negotiation, review and access of electronic resources for UK HE” (via @benshowers) and the Talis Aspire Community Edition (aggregated reading lists across several HEIs).

PS I’m working out how to make the slides a little bit more useful as a post hoc/legacy resource by posting them with a bit a context and commentary… But it may take a bit of time…

PPS on the way home, I listened to this Long Now Foundation seminar by Brewster Kahle on Universal Access to All Knowledge, which got me wondering about the extent to which University libraries are depositing resources into the Internet Archive..? There’s a nice piece at the end that makes the point that IPR is such that in terms of the digital record, there’s likely to be a gap in the timeline of archived content right around the 20th century…

PPPS as far as library futures go, here’s a loosely related Roadmapping TEL activity on “Ideas that influence the future of technology enhanced learning” that is currently running on Ideascale.

There were also several discussions during the day relating to information skills needs for 21st century librarians. Some of the ANCIL reports from the Arcadia project on a new information literacy curriculum may be of interest to JIBS members in this regard, I think? Arcadia Project Report

I think there’s a real need for librarians to help folk make sense of the wealth of data out there, and this in part requires a good understanding of network structures and organisations, not just a concentration on hierarchical models.

Hear (sic) also, for example, OU Vice Chancellor Martin Bean on ‘sensemaking’ and the role of the library from his 2010 ALT-C Keynote:

I think it’s also time to start seeing people as information and knowledge resources, as well as just texts…

Written by Tony Hirst

February 25, 2012 at 1:58 pm

Posted in Library, Presentation

Invisible Library Support – Now You Can’t Afford Not to be Social?

If you live by pop tech feed or Twitter, you’ve probably heard that Google is rolling out a new style of socially powered search results. If not, or if you’re still not clear about what it entails, read Phil Bradley’s post on the matter: Why Google Search Plus is a disaster for search.

Done that? If not, why not? This post isn’t likely to make much sense at all if you don’t know the context. Here’s the link again: Why Google Search Plus is a disaster for search

So the starting point for this post is this: Google is in the process of rolling out a new web search service that (optionally) offers very personal search results that contains content from folk that Google thinks you’re associated with, and that Google is willing to show you based on license agreements and corporate politics.

Think about this for a minute…. in e the totally personalised view, folk will only see content that their friends have published or otherwise shared…

In Could Librarians Be Influential Friends?, I wondered aloud whether it made sense for librarians and other folk involved with providing support relating to resource discovery and recommendation to start a) creating social network profiles and encouraging their patrons to friend them, and b) start recommending resources using those profiles in order to start influencing the ordering/ranking of results in patrons’ search results based on those personal recommendations. The idea here was that you could start to make invisible frictionless recommendations by influencing the search engine results returned to your patrons (the results aren’t invisible because your profile picture may appear by the result showing that you recommend it. They’re frictionless in the sense that having made the original recommendation, you no longer have to do any work in trying to bring it to the attention of your patron – the search engines take care of that for you (okay, I know that’s a simplistic view;-). [Hmm.. how about referring to it as recommendation mode support?]

(Note that there is an complementary form of support to the approach which I’ve previously referred to as Invisible Library Tech Support (responsive mode support?; which I guess is also frictionless, at least from the perspective of the patron) in which librarians friend their patrons or monitor generic search terms/tags on Q&A sites and then proactively respond to requests that users post into their social networks more generally.)

With the aggressive stance Google now seems to be taking towards pushing social circle powered results, I think we need to face up to the fact – as Phil Bradley pointed out – that if librarians want to make sure they’re heard by their patrons, they’re going to need to start setting up social profiles, getting their patrons to friend them, and start making content and resource recommendations just anyway in order to make them available as resources that are indexed by patrons’ personal search engines. The same goes for publishers of OERs, academic teaching staff, and “courses”.

If we think of Google social search as searching over custom search engines bound by resources created and recommended by members of a users social circle, if you want to make (invisible) recommendations to a user via their (personalised) web search results, you’re going to need to make sure that the resources/content you want to recommend is indexed by their personal search engines. Which means: a) you need to friend them; and b) you need to share that content/those resources in that social context.

(Hmmm…this makes me think there may be something in the course custom search engine approach after all… Specifically, if the course has a social profile, and recommends the links contained within the course via that profile, they become part of the personalised search index of student’s following that course profile?)

Just by the by, as another example of Google completely messing things up at the moment, I notice that when I share links to posts on this blog via Google+, they don’t appear as trackbacks to the post in question. Which means that if someone refers to a post on this blog on Google+, I don’t know about it… whereas if they blog the link, I do…

See also my chronologically ordered posts on the eroding notion of “Google Ground Truth”.

[Invisible vs frictionless (and various notions of that word) is all getting a bit garbled; see eg @briankelly's Should Higher Education Welcome Frictionless Sharing and my comments to it for a little more on this...]

PS I’ve been getting increasingly infuriated by the clutter around, and lack of variation within, Google search results lately, so I changed my default search engine to Bing. The results are a bit all over the place compared to the Google results I tend to get, but this may be down in part to personalisation/training. I am still making occasional forays to Google, but for now, Bing is it… (because Bing is not Google…)

PPS Hah – just noticed: Google Search Plus doesn’t mean plus in the sense of search more, it means search Google+, which is less, or minus the wider world view…;-)

PPPS I keep meaning to blog this, and keep forgetting: Turn[ing] off [Google] search history personalization, in particular: “If you’ve disabled signed-out search history personalization, you’ll need to disable it again after clearing your browser cookies. Clearing your Google cookie clears your search settings, thereby turning history-based customizations back on.” WHich is to say, when you disable personalisation, you don’t disable personalisation against your Google account, you disable it only insofar as it relates to your current cookie ID?

Written by Tony Hirst

January 13, 2012 at 1:35 pm

Using OAI-PMH as a Single Record Level Query Interface to Citeseer

Picking up on a query I raised in Citation Positioning, here’s a quick summary of an online discussion featuring variously @edsu, @epoz, @ostephens and myself (I’m the one who knows absolutely nothing…!)

The context is: can I use the OAI-PMH interface on Citeseer to grab record level machine readable results from Citeseer. Note that I donlt really want to harvest all the Citeseer data, pop it into a database of my own, and then run queries on that; I just want a quick and dirty API to make a handful of calls to particular queries for a proof of concept hack;-)

Here’s what the Citeseer HTML page looks like:

A citeseer results page

It has a URL of the form: http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.122.728

The tabbed results pages have their own URLs:

- Active Bibliography, of the form http://citeseer.ist.psu.edu/viewdoc/similar?doi=10.1.1.122.7284&type=ab
- Co-Citation, of the form http://citeseer.ist.psu.edu/viewdoc/similar?doi=10.1.1.122.7284&type=cc
- Clustered Documents, of the form http://citeseer.ist.psu.edu/viewdoc/similar?doi=10.1.1.122.7284&type=sc

Here’s what I’m guessing:
- the ‘front page’ results are links to papers that reference/cite the target article, ordered by the number of times that they themselves have been cited; this is a subset of the total set of papes that cite the target article;
- the Active Bibliography is a subset of the articles that are referenced from/cited by the target article that have themselves been recently cited elsewhere (?! I’m guessing – the Citeseer site doesn’t seem to provide an explanation anywhere?)
- the co-citations are… I have no idea? Other papers that have been cited by papers that cite the target paper?
- Clustered Documents – these seem to be other Citeseer records relating to the same paper; do they all have the same citation info? I have no idea?????

As far as the OAI interface goes, it seems we can grab an individual record using a query of the form:

http://citeseerx.ist.psu.edu/oai2?verb=GetRecord&identifier=oai:CiteSeerX.psu:10.1.1.122.7284&metadataPrefix=oai_dc

which returns a result of the form:

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2011-12-08T16:24:31+00:00</responseDate>
<request identifier="oai:CiteSeerX.psu:10.1.1.122.7284" metadataPrefix="oai_dc" verb="GetRecord">http://citeseerx.ist.psu.edu/oai2</request>
<GetRecord>
<record>
<header>
<identifier>oai:CiteSeerX.psu:10.1.1.122.7284</identifier>
<datestamp>2009-05-28</datestamp>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>The structure and function of complex networks</dc:title>
<dc:creator>M. E. J. Newman</dc:creator>
<dc:description>
Inspired by empirical studies of networked systems such as the Internet, social networks, and biological networks, researchers have in recent years developed a variety of techniques and models to help us understand or predict the behavior of these systems. Here we review developments in this field, including such concepts as the small-world effect, degree distributions, clustering, network correlations, random graph models, models of network growth and preferential attachment, and dynamical processes taking place on networks.
</dc:description>
<dc:contributor>
The Pennsylvania State University CiteSeerX Archives
</dc:contributor>
<dc:publisher/>
<dc:date>2009-05-28</dc:date>
<dc:date>2008-12-04</dc:date>
<dc:date>2003</dc:date>
<dc:format>application/pdf</dc:format>
<dc:type>text</dc:type>
<dc:identifier>

http://citeseerx.ist.psu.edu/citeseerx/viewdoc/summary?doi=10.1.1.122.7284

</dc:identifier>
<dc:source>

http://www.cs.berkeley.edu/~christos/classics/graphsurvey.pdf

</dc:source>
<dc:language>en</dc:language>
<dc:relation>10.1.1.109.4049</dc:relation>
<dc:relation>10.1.1.120.3875</dc:relation>
<dc:relation>10.1.1.31.1768</dc:relation>
<dc:relation>10.1.1.153.5943</dc:relation>
<dc:relation>10.1.1.37.234</dc:relation>
<dc:relation>10.1.1.18.2720</dc:relation>
<dc:relation>10.1.1.30.6583</dc:relation>
<dc:relation>10.1.1.25.5619</dc:relation>
<dc:relation>10.1.1.104.3739</dc:relation>
<dc:relation>10.1.1.56.6742</dc:relation>
<dc:relation>10.1.1.117.7097</dc:relation>
<dc:relation>10.1.1.15.8793</dc:relation>
<dc:relation>10.1.1.33.1635</dc:relation>
<dc:relation>10.1.1.139.1580</dc:relation>
<dc:relation>10.1.1.30.9552</dc:relation>
<dc:relation>10.1.1.184.8874</dc:relation>
<dc:relation>10.1.1.24.6195</dc:relation>
<dc:relation>10.1.1.16.478</dc:relation>
<dc:relation>10.1.1.31.3763</dc:relation>
<dc:relation>10.1.1.25.7011</dc:relation>
<dc:relation>10.1.1.37.5917</dc:relation>
<dc:relation>10.1.1.84.9512</dc:relation>
<dc:relation>10.1.1.7.1950</dc:relation>
<dc:relation>10.1.1.129.6877</dc:relation>
<dc:relation>10.1.1.25.1360</dc:relation>
<dc:relation>10.1.1.16.1168</dc:relation>
<dc:relation>10.1.1.115.8316</dc:relation>
<dc:relation>10.1.1.143.1502</dc:relation>
<dc:relation>10.1.1.130.1956</dc:relation>
<dc:relation>10.1.1.20.814</dc:relation>
<dc:relation>10.1.1.21.838</dc:relation>
<dc:relation>10.1.1.16.2407</dc:relation>
<dc:relation>10.1.1.23.9684</dc:relation>
<dc:relation>10.1.1.62.7557</dc:relation>
<dc:relation>10.1.1.16.6906</dc:relation>
<dc:relation>10.1.1.2.4033</dc:relation>
<dc:relation>10.1.1.43.7796</dc:relation>
<dc:relation>10.1.1.25.1174</dc:relation>
<dc:relation>10.1.1.10.4509</dc:relation>
<dc:relation>10.1.1.27.3417</dc:relation>
<dc:relation>10.1.1.120.9902</dc:relation>
<dc:relation>10.1.1.20.5323</dc:relation>
<dc:relation>10.1.1.86.8584</dc:relation>
<dc:relation>10.1.1.3.3888</dc:relation>
<dc:relation>10.1.1.1.9569</dc:relation>
<dc:relation>10.1.1.78.4413</dc:relation>
<dc:relation>10.1.1.142.7059</dc:relation>
<dc:relation>10.1.1.161.114</dc:relation>
<dc:relation>10.1.1.143.1242</dc:relation>
<dc:relation>10.1.1.58.2706</dc:relation>
<dc:relation>10.1.1.35.8293</dc:relation>
<dc:relation>10.1.1.85.7061</dc:relation>
<dc:relation>10.1.1.129.709</dc:relation>
<dc:relation>10.1.1.16.5260</dc:relation>
<dc:relation>10.1.1.7.4603</dc:relation>
<dc:relation>10.1.1.37.2417</dc:relation>
<dc:relation>10.1.1.37.2641</dc:relation>
<dc:relation>10.1.1.117.3665</dc:relation>
<dc:relation>10.1.1.122.6034</dc:relation>
<dc:relation>10.1.1.11.7594</dc:relation>
<dc:relation>10.1.1.20.9298</dc:relation>
<dc:relation>10.1.1.27.4715</dc:relation>
<dc:relation>10.1.1.94.2340</dc:relation>
<dc:relation>10.1.1.196.2257</dc:relation>
<dc:relation>10.1.1.1.2728</dc:relation>
<dc:relation>10.1.1.58.3869</dc:relation>
<dc:relation>10.1.1.33.6972</dc:relation>
<dc:relation>10.1.1.35.4242</dc:relation>
<dc:relation>10.1.1.28.9399</dc:relation>
<dc:relation>10.1.1.12.2717</dc:relation>
<dc:relation>10.1.1.6.61</dc:relation>
<dc:relation>10.1.1.7.6756</dc:relation>
<dc:relation>10.1.1.15.4857</dc:relation>
<dc:relation>10.1.1.58.2087</dc:relation>
<dc:relation>10.1.1.10.352</dc:relation>
<dc:relation>10.1.1.110.6845</dc:relation>
<dc:rights>
Metadata may be used without restrictions as long as the oai identifier remains attached to it.
</dc:rights>
</oai_dc:dc>
</metadata>
</record>
</GetRecord>
</OAI-PMH>

I’m guessing the dc:relation elements refer to the papers listed on the ‘front page’ of the results for a given paper, that is, they are the most heavily cited papers that cite the target paper?

So a few questions that arise:

- what do the different results listings on the HTML pages actually refer to?
- what do the results in the OAI query above relate to?
- is it possible to get a list of all the papers cited/referenced by a target article? (Or failing that, is it possible to get hold of the Active Bibliography relations, which are presumably a subset of the complete set of bibliographic references contained within a paper?)
- is it possible to get a list of all the paper that cite/reference a particular target article?

If you can answer any or all of the above questions, please feel free to post the answer(s) in a comment below…:-)

Written by Tony Hirst

December 8, 2011 at 5:20 pm

Posted in Anything you want, Library

Tagged with

What’s Inside a Book?

A couple of months ago, when I started looking at the idea of emergent social positioning in online social networks, I was focussing on trying to model the positioning of certain brands and companies, in part with a view to trying to identify ones that were associated with innovation, or future thinking in some way.

Based on absolutely no evidence at all, I surmised that one useful signal in this regard might be the context in which companies or brands are mentioned in popular, MBA-related business books, the sort of thing that Harvard Business Review publish, for example.

Here’s how my thinking went then:

- generate a bipartite network graph that connects the book’s index terms with page numbers of the pages they appear on based on the index entries* in a given book. A bipartite graph is one that contains two sorts or classes of node (in this case, index term nodes and book page number nodes). The index terms are likely to include companies, brands, people and ideas/concepts. Sometimes, particular index terms may be identified as companies, names, etc, through presentational mark up – a bold font, or italics, for example. These presentational conventions can often be mapped onto semantic equivalents. Terms might also be passed through something like the Reuters’ Open Calais service, or TSO’s Data Enrichment Service.

- collapse the network graph by generating links between things that are connected to the same page number and remove the page number nodes from the graph. You now have a graph that connects brands, people and other index terms with each other, where edges represent the relation “is on the same page in the same book as”. If companies and other index terms appear on several pages together, we might reflect this by increasing the weight of the edge that connects them, for example by using edge weight to represent the number of pages where the two terms co-exist.

(*This will be obvious to some, but not to others. To a certain extent, a book index provides a faceted/search term limited search engine interface to a book, that returns certain pages as results to particular queries…)

Note that we can generate a network for a specific book, in which case we can render a graphical summary of the content, relations within and structure of that book, or we can generate more comprehensive networks that summarise the index term relations across several books.

My thinking then was that if we can grab the indexes of a set of business books, we could map which companies and brands were being associated either with each other or with particular concepts in MBA land.

Which is where the problem lays – because I haven’t found anywhere where I can readily get hold of the indexes of business books in a sensible machine readable format. Given an electronic cpy of a book, I guess I could run some text processing algorithms over it looking for word pairs in close association with each other and generating my own view over the book. But the reason for using an actual book index is at least twofold: firstly, because there has presumably been a a quality process that determines what terms are entered into the index; secondly, because the index, if used by a human reader, will be influencing which parts of the book (and hence which related terms) they will be exposed to.

(It’s maybe also worth noting that books also contain a lot of other structured metadata – tables of contents, lists of figures, titles, headings, subheadings, emphasis, lists, captions, and so on, all of which provide cues as to how the book is structured and how ideas and entities contained within it relate to each other.)

As to why I’m posting this now? I first floated this idea with @edchamberlain following a JISC bibliography data event, and he reminded me of it at the Arcadia Project review a couple of days ago ;-)

Related, sort of: Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags, which looked at mapping corporate mentions in the BBC/OU co-pro business programme The Bottom Line:

First attempt at tagging BBC/OU 'The Bottom Line' progs using opencalais

Also Citation Positioning.

PS this is clever – and related – via @ostephens: http://www.eatyourbooks.com/ (“‘Tell us which books you own’ We have indexed the most popular cookbooks & magazines so recipes become instantly searchable.”).

Written by Tony Hirst

December 8, 2011 at 4:18 pm

Citation Positioning

It’s been years and years since I did either a formal literature review, or used a reference manager like EndNote or RefWorks in anger, but whilst at the Arcadia Project review in Cambridge a couple of days ago, I started wondering what sorts of ‘added value’ features I’d like to see, maybe even expect, from referencing software nowadays…

One of the ideas I’ve been playing with recently is the idea of emergent social positioning (ESP;-) in online social networks, which I’m defining in terms of where an individual or an expression of a particular interest group might be positioned in terms of the socially projected interests of people following that person or interest group.

For the case of an individual, the approach I’m taking is to look at who the followers of that individual follow to any great extent; for the case of an interest group, as evidenced by users of a particular hashtag, for example, it might be to look at who the followers of the users of the hashtag also follow in significant numbers.

A slightly more constrained approach might be to look at how the followers of the individual or the hashtag users follow each other (a depth 1.5 follower network about an indvidual or set of individals, in effect).

So for example, here’s a map I just grabbed of folk who are followed by 3 or more followers from a sampling of the followers recent users of the #gdslaunch (Government Digital Service launch) hashtag.

In the vicinity of #gdslaunch

So what does this have to do with reference managers? Let’s start with a single academic paper (the ‘target’ paper), that contains a list of references to other works. If we can easily grab the reference lists from all those works, we can generate a depth 1.5 reference map that show how the works referenced in the first paper reference each other. Exploring the structural properties of this map may help us better understand the support basis for the ideas covered in our target paper.

By looking at the depth 2 reference network (that is, the network that shows references included in the target paper, and all their references), we may be able to discover additional (re)sources relevant to the target paper.

Unfortunately, getting free and and easy machine readable access to the lists of references contained within journal articles, conference papers and books is not trivial. There are patchy services such as CiteSeer, Citebase or opencitations.net, but I don’t think services like Mendeley, Zotero or CiteUlike are yet expressing this sort of data? Or maybe they are, and I’m missing a trick somewhere.

(Just by the by, presumably some of the commercial citation services have APIs that support at least accessing this data? If you know of any, could you add a link in the comments please?:-)

Another hack I’d like to try is to generate what more closely corresponds to the social positioning idea, which is to grab the references from a target paper, and then the papers that cite those references and see how they all link together. This would help position the target paper in the space of other papers referencing similar works. I think CiteSeer has this sort of functionality, though not in a graphical form?

PS on my to do list is seeing whether I can get reference lists for articles out of Citeseer using the Citeseer OAI-PMH endpoint. I’ve got as far as installing the pyoai Python library, but not had time to try it out yet. If anyone knows of a guide to OAI for complete novices, ideally with pyoai examples I can crib from, please post a link (or some examples) via the comments:-)

Written by Tony Hirst

December 8, 2011 at 1:47 pm

Posted in Infoskills, Library

Tagged with

Notes on Custom Course Search Engines Derived from OU Structured Authoring Documents

Over the last few days, I’ve been tinkering with OU Structured Authoring documents, XML docs from which OU course materials – both print and HTML – are generated (you can get an idea about what they look like from OpenLearn: find a course page with a URL of the form http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&direct=1 and change direct to content: http://openlearn.open.ac.uk/mod/oucontent/view.php?id=397337&content=1; h/t to Colin Chambers for that one;-). I’ve been focussing in particular on the documents used to describe T151, an entry level online course I developed around all things gaming (culture, business, design and development), and the way in which we can automatically generate custom search engines based on these documents.

The course had a very particular structure – weekly topic explorations framed as a preamble, set of guiding questions, suggested resources (organised by type) and a commentary, along with a weekly practical session.

One XML doc was used per week, and was used to generate the separate HTML pages for each week’s study.

One of the experimental components of the course has been a Google Custom Search Engine, that supports searches over external resources that are linked to from the blog. The course also draws heavily on the Digital Worlds Uncourse blog, a site used to scope out the design of the course, as well as draft some of the materials used within it, and the CSE indexes both that site and the sites that are linked from it. (See eSTEeM Project: Custom Course Search Engines and Integrating Course Related Search and Bookmarking? for a little more context around this.)

Through using the course custom search engine myself, I have found several issues with it:

1) with a small index, it’s not really very satisfactory. If you only index exact pages that are linked to from the site, it can be quite hard getting any hits at all. A more relaxed approach might be to index the domains associated with resources, and also include the exact references explicitly with a boosted search rank. At the current time, I have code that scrapes external links from across the T151 course materials and dumps them into a single annotations file (the file that identifies which resources are included in the CSE) without further embellishment. I also have code that identifies the unique domains that are linked to from the course materials and which can also be added to the annotations file. On the to do list is to annotate the resources with labels that identify which topic they are associated with so we can filter results by topic.

2) the Google Custom Search Engines seem to behave very oddly indeed. In several of my experiments, one word queries often returned few results, more specific queries building on the original search term delivered more and different results. This gives a really odd search experience, and one that I suspect would put many users off.

3) I’ve been coming round more and more to the idea that the best way of highlighting course resources in a search context is through the use of Subscribed Links, that a user can subscribe to and that then appear in their Google search results if there is an exact query match. Unfortunately, Google pulled the Subscribed Links service in September (A fall spring-clean; for example of what’s been lost, see e.g. Stone Temple Consulting: Google Co-Op Subscribed Links).

4) The ability to feed promotions into the top of the CSE results listing is attractive (up to 3 promoted links can be displayed for any given query), but the automatic generation of query terms is problematic. Promotion definitions take the form:

<Promotion image_url="http://kmi.open.ac.uk/images/ou-logo.gif"
  	title="Week 4"
  	id="T151_Week4"
  	queries="week 4,T151 week 4,t151 week 4"
  	url="http://www.open.ac.uk"
  	description="Topic Exploration 4A - An animated experience Topic exploration 4B - Flow in games "/>

Course CSE - week promotion

There are several components we need to consider here:

  1. queries: these are the phrases that are used to trigger the display of the particular promotions links. Informal testing suggests that where multiple promotions are triggered by the same query, the order in which they are defined in the Promotions file determines the order in which they appear in the results. Note that the at most three (3) promotions can be displayed for any query. Queries may be based at least around either structural components (such as study week, topic number), subject matter terms (e.g. tags, keywords, or headings) and resource type (eg audio/video material, academic readings etc), although we might argue the resource type is not such a meaningful distinction (just because we can make it doesn’t mean we should!). In the T51 materials, presentation conventions provide us with a way of extracting structural components and using these to seed the promotions file. Identifying keywords or phrases is more challenging: students are unlikely to enter search phrases that exactly match section or subsection headings, so some element of term extraction would be required in order to generate query terms that are likely to be used.
  2. title: this is what catches the attention, so we need to put something sensible in here. There is a limit of 160 characters on the length of the title.
  3. description: the description allows us to elaborate on the title. There is a limit of 200 characters on the length of the description.
  4. url: the URL is required but not necessarily ‘used’ by our promotion. That is, if we are using the promotion for informational reasons, and not necessarily wanting to offer a click through, the link may be redundant. (However, the CSE still requires it to be defined.) Alternatively, we might argue that the a click through action should always be generated, in which case it might be worth considering whether we can generate a click through to a more refined query on the CSE itself?

Where multiple promotions are provided, we need to think about:
a) how they are ordered;
b) what other queries they are associated with (i.e. their specificity);
c) where they link to.

In picking apart the T151 structured authoring documents, I have started by focussing on the low hanging fruit when it comes to generating promotion links. Looking through the document format, it is possible to identify topics associated with separate weeks and questions associated with particular topics. This allows us to build up a set of informational promotions that allow the search engine to respond to queries of what we might term a navigational flavour. So for example, we can ask what topics are covered in a particular week (I also added the topic query as a query for questions related to a particular topic):

Course CSE - multiple promotions

Or what a particular question is within a particular topic:

COurse CSE - what's the question?

The promotion definitions are generated automatically and are all very procedural. For example, here’s a fragment from the definition of the promotion from question 4 in topic 4A:

<Promotion 
  	title="Topic Exploration 4A Question 4"
  	queries="topic 4a q4,T151 topic 4a q4,t151 topic 4a q4,topic 4a,T151 topic 4a,t151 topic 4a"
  	... />

The queries this promotion will be displayed for are all based around navigational structural elements. This requires some knowledge of the navigational query syntax, and also provides an odd user experience, because the promotions only display on the main CSE tab, and the organic results from indexed sites turn up all manner of odd results for queries like “week 3″ and “topic 1a q4″… (You can try out the CSE here.)

The promotions I have specified so far also lack several things:

1) queries based on the actual question description, so that a query related to the question might pull the corresponding promotion into the search results (would that be useful?)

2) a sensible link. At the moment, there is no obvious way in the SA document of identifying one or more resources that particularly relate to a specific question. If there was such a link, then we could use that information to automatically associate a link with a question in the corresponding promotions element. (The original design of the course imagined the Structured Authoring document itself being constructed automatically from component parts. In particular, it was envisioned that suggested links would be tagged on a social bookmarking service and then automatically pulled into the appropriate area of the Structured Authoring document. Resources could then be tagged in a way that associates them with one or more questions (or topics), either directly though a question ID, or indirectly through matching subject tags on a question and on a resource. The original model also considered the use of “suggested search queries” that would be used to populate suggested resources lists with results pulled in live from a (custom) search engine…)

At the moment, it is possible to exploit the T151 document structure to generate these canned navigational queries. The question now is: are promotional links a useful feature, and how might we go about automatically identifying subject meaningful queries?

At the limit, we might imagine the course custom search engine interface being akin to the command line in a text based adventure game, with the adventure itself being the learning journey, and the possible next step a combination of Promotions based guidance and actual search results…

[Code for the link scraping/CSE file generation and mindmap generator built around the T151 SA docs can be found at Github: Course Custom Search Engines]

PS as ever, I tend to focus on tinkering around a rapid prototype/demonstration at the technical systems overview level, with a passing nod to the usefulness of the service (which, as noted above, is a bit patchy where the searchengine index is sparse). What I haven’t done is spend much time thinking about the pedagogical aspects relating to how we might make most effective use of custom search engines in the course context. From a scoping point of view, I think there are a several things we need to unpick that relate to this: what is it that students are searching for, what context are they searching in, and what are they expecting to discover?

My original thinking around custom course search engines was that they would augment a search across course materials by providing a way of searching across the full text of resources* linked to from the course materials, and maybe also provide a context for searching over student suggested resources.

Search hierarchy

It strikes me that the course search engine is most likely to be relevant where there is active curation of the search engine that provides a search view over a reasonably sized set of resources discovered by folk taking the course and sharing resources related to it. “MOOCs” might be interesting in this respect, particularly where: 1) it is possible to include MOOC blog tag feeds in the CSE as a source of relevant content (both the course blog content and resources linked to from that content – the CSE can be configured to include resources that are linked to from a specified resource); 2) we can grab links that are tagged and shared with the MOOC code on social media and add those to the CSE annotations file. (Note that in this case, it would make sense to resolve shortened links to their ultimate destination URL before adding them to the CSE.) I’m not sure what role promotions might play in a MOOC though, or the extent to which they could be automatically generated?

*Full text search across linked to resources is actually quite problematic. Consider the following classes of online resources that we might expect to be linked to from course materials:

  • academic papers, often behind a paywall: links are likely to be redirected through a library proxy service allowing for direct click-thru to the resource using institutional credentials (I assume the user is logged in to the VLE to see the link, and single sign on support allows direct access to any subscribed to resources via appropriate proxies. That is, the link to the resource leads directly to the full text, subscribed to version of the resource if the user is signed on to the institutional system and has appropriate credentials). There are several issues here: the link that is supplied to the CSE should be be the public link to the article homepage; the article homepage is likely to reveal little more than the paper abstract to the search engine. I’m not sure if Google Scholar does full-text indexing of articles, but even if it does, Scholar results are not available to the CSE. (There is also the issue of how we would limit the Scholar search to the articles we are linking to from the course materials.)
  • news and magazine articles: again, these may be behind a paywall, but even if they are, they may have been indexed by Google. So they may be full text discoverable via a CSE, even if they aren’t accessible once you click through…
  • video and audio resources: discovery in a large part will depend on the text on the web page the resources are hosted on. If the link is directly to an audio or video file, discoverability via the CSE may well be very limited!
  • books: Google book search provides full text search, but this is not available via a CSE. Full text searchable collections of books are achievable using Google Books Library Shelves; there’s also an API available.

I guess the workaround to all this is not to use a Google Custom Search Engine as the basis for a course search engine. Instead, construct a repository that contains full text copies of all resources linked to from the course, and index that using a local search engine, providing aliased links to the original sources if required?

Fudging the CSE with a local searchengine

However, that wasn’t what this experiment was about!;-)

Course Resources as part of a larger connected graph

Another way of thinking about linked to course resources is that they are a gateway into a set of connected resources. Most obviously, for an academic paper it is part of a graph structure that includes:
- links to papers referenced in the article;
- links to papers that cite the article;
- links to other papers written by the same author;
- links to other papers in collections containing the article on services such as Mendeley;
- links into the social graph, such as the social connections of the author, or the discovery of people who have shared a link to the resource on a public social network.
For an informal resource such as a blog post, it might be other posts linked to from the post of interest, or other posts that link to it.

Thinking about resources as being part of one or more connected graphs may influence our thinking about the pedagogy. If the intention is that a learner is directed to a resource as a terminal, atomic resource, from which they are expected to satisfy a particular learning requirement, then we aren’t necessarily interested in the context surrounding the resource. If the intention is that the resource is gateway to a networked context around one or more ideas or concepts, then we need to select our resources so that they provide a springboard to other resources. This can be done directly (eg though following references contained within the work, or tracking down resources that cite it), or indirectly, for example by suggesting keywords or search phrases that might be used to discover related resources by independent means. Alternatively, we might link to a resource as an exemplar of the sort of resource students are expected to work with on a given activity, and then expect them to find similar sorts of, but alternative, resources for themselves.

Written by Tony Hirst

November 8, 2011 at 1:54 pm

Digital Scholarship and Academically Discoverable Blogs

What does it take for a digital scholar’s blog to become academically credible?

At a time when we know that folk go to Google for a lot of their search needs, the academic library argues it’s case, in part, as a place where you can go to get access to “good quality” (academically credible) and comprehensive information through what we might term academic search engines.

The library’s search offerings are presumably subscription based (?) and their results often link through to subscription content; but the academic life is a privileged one, and our institutions cover the access costs on our behalf. (I guess this could almost be considered one of the “+ benefits” you might imagine an enthusiastic copywriter assuming for an academic job ad!)

The library and information access privilege extends to students too, so we might imagine a well-intentioned, but perhaps naive, student thinking that if they run a search using the Library’s “academically certified” search engine, they will get the sort of result they can happily cite in an essay, without fear of criticism about the academic credibility of the source publication.

We might imagine, too, that academics and researchers also place an element of trust in the credibility of sources returned as results to search queries raised using library discovery services.

So here’s a claim (which is untested and may or may not be true): if you want your work to stand a chance of being referenced in a piece of scholarly work, you need it to be discoverable in the places that the scholar goes to discover supporting claims or related material for the work they’re doing. The assumption is that the scholar will use a library provided discovery service because it is less noisy than a general web search engine and is likely to return to resources from credible sources. The curation of sources – and what is not included in the index – is in part what the subscription discovery service offers.

What this means is that if digital scholars want their blogging activity to be discoverable in the academic context, they need to find some way of getting some of their blogposts at least into academic discovery service indices.

But this is not likely to happen, right? Wrong… Here’s what I noticed when I ran a search using the OU Library’s “one-stop” search earlier today:

Acaemically verified???

A top two reference to a Mashable article (albeit identified as a news item) via the Newsbank database and a top ranked periodical article from Fast Company (via the UK/EIRE Reference Centre database). (Hmmm, I wonder how quickly this content is indexed? That is, how soon after posting on Mashable does an article become discoverable here?)

So maybe I need to start writing for Mashable?!

Or maybe not…?

One of the attractive features of WordPress as a publishing platform is that it provides feeds for everything, including category and tag level feeds. A handful of my category feeds are syndicated, for example to R-Bloggers, the Guardian Datablog blogroll and (I’m not sure if this still works?) the Online Journalism blog. Only posts tagged in a particular way are sent to the syndicated feeds.

So I’m wondering this: how much mileage would there be in setting up aggregation blogs around particular academic areas that not only syndicate content from publisher members, but also act as a focus for indexing by a service such as Newsbank? The content would be publisher-moderated (I don’t post content on non-R related matters to my R-bloggers syndication feed) and hopefully responsive to the norms of the aggregation community itself.

Precendents already exist of course; for example, Nature.com blogs aggregates blogs from a variety of working scientists. Is this content discoverable via the OU Library’s one stop/Ebsco search?

For an academic’s work to count in RAE terms, it needs to be cited. In order to be cited, it needs to be discoverable. Even if it isn’t citeable as a formal article, it can still make a contribution if it’s discoverable. To be academically discoverable, content needs to be discoverable via academic search engines. So why should Mashable count, but not personal academic blogs that are respected within their own communities?

PS I’m a bit out of touch with referencing converntions; I remember that pers. comm. used to be an acceptable way of crediting someone’s ideas they had personally communicated to you; is there a pub. comm. (that’s pub. comm. not just pub comm. ;-) equivalent that might be used to refer to online or offline public communications that might not otherwise be citeable?

Written by Tony Hirst

November 2, 2011 at 11:41 am

Posted in Infoskills, Library, OU2.0

Follow

Get every new post delivered to your Inbox.

Join 757 other followers