What’s Inside a Book?

A couple of months ago, when I started looking at the idea of emergent social positioning in online social networks, I was focussing on trying to model the positioning of certain brands and companies, in part with a view to trying to identify ones that were associated with innovation, or future thinking in some way.

Based on absolutely no evidence at all, I surmised that one useful signal in this regard might be the context in which companies or brands are mentioned in popular, MBA-related business books, the sort of thing that Harvard Business Review publish, for example.

Here’s how my thinking went then:

– generate a bipartite network graph that connects the book’s index terms with page numbers of the pages they appear on based on the index entries* in a given book. A bipartite graph is one that contains two sorts or classes of node (in this case, index term nodes and book page number nodes). The index terms are likely to include companies, brands, people and ideas/concepts. Sometimes, particular index terms may be identified as companies, names, etc, through presentational mark up – a bold font, or italics, for example. These presentational conventions can often be mapped onto semantic equivalents. Terms might also be passed through something like the Reuters’ Open Calais service, or TSO’s Data Enrichment Service.

– collapse the network graph by generating links between things that are connected to the same page number and remove the page number nodes from the graph. You now have a graph that connects brands, people and other index terms with each other, where edges represent the relation “is on the same page in the same book as”. If companies and other index terms appear on several pages together, we might reflect this by increasing the weight of the edge that connects them, for example by using edge weight to represent the number of pages where the two terms co-exist.

(*This will be obvious to some, but not to others. To a certain extent, a book index provides a faceted/search term limited search engine interface to a book, that returns certain pages as results to particular queries…)

Note that we can generate a network for a specific book, in which case we can render a graphical summary of the content, relations within and structure of that book, or we can generate more comprehensive networks that summarise the index term relations across several books.

My thinking then was that if we can grab the indexes of a set of business books, we could map which companies and brands were being associated either with each other or with particular concepts in MBA land.

Which is where the problem lays – because I haven’t found anywhere where I can readily get hold of the indexes of business books in a sensible machine readable format. Given an electronic cpy of a book, I guess I could run some text processing algorithms over it looking for word pairs in close association with each other and generating my own view over the book. But the reason for using an actual book index is at least twofold: firstly, because there has presumably been a a quality process that determines what terms are entered into the index; secondly, because the index, if used by a human reader, will be influencing which parts of the book (and hence which related terms) they will be exposed to.

(It’s maybe also worth noting that books also contain a lot of other structured metadata – tables of contents, lists of figures, titles, headings, subheadings, emphasis, lists, captions, and so on, all of which provide cues as to how the book is structured and how ideas and entities contained within it relate to each other.)

As to why I’m posting this now? I first floated this idea with @edchamberlain following a JISC bibliography data event, and he reminded me of it at the Arcadia Project review a couple of days ago ;-)

Related, sort of: Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags, which looked at mapping corporate mentions in the BBC/OU co-pro business programme The Bottom Line:

First attempt at tagging BBC/OU 'The Bottom Line' progs using opencalais

Also Citation Positioning.

PS this is clever – and related – via @ostephens: http://www.eatyourbooks.com/ (“‘Tell us which books you own’ We have indexed the most popular cookbooks & magazines so recipes become instantly searchable.”).


  1. Ed Chamberlain

    Few thoughts:

    1) IPR around index as a separate intellectual unit – issue or is this ‘private research?’
    2) Formatting text from scanned image into said graph
    3) Ways to get hold of other useful structured elements in a text

    Without knowing anything about them, would eBooks in ePub or mobi format support this of a similar level of semantic granularity?

    We next need a forward thinking publisher ready to ‘donate’ a load of texts for the purposes of an experiment. Getting O’Rielly texts (always well indexed) and examining their indexes for mention of startups / popular code libraries (how many now mention Jquery?) may be an alternative way to go.

    Otherwise, we could scan a few and try …

    • Tony Hirst

      @ed So: do you know anyone at Cambridge University Press who might be interested eg in context of University Publishing Online [http://www.cambridge.org/gb/knowledge/news/newsitem/item6617575/?site_locale=en_GB ]? Or maybe someone at Harvard Business Review? (Is there anyone on the UL visiting committee from there?;-)

  2. Tony Hirst

    Related, ish: a comment I posted on http://blogs.ch.cam.ac.uk/pmr/2012/03/02/our-protocol-for-text-mining-preamble-and-%E2%80%9Cinstitutionalism%E2%80%9D-elsevier-and-other-publishers-should-take-note/#comment-103422

    I’m not sure how broad the remit of your response is expected to be, but I often wonder about how publishers can do more to help unlock the structural value contained within papers they publish, some of which is a direct result of their efforts. One of the things I’ve started pondering is the filter value associated with indices. I’m not sure if these are manually compiled, fully automated, bootstrapped by automation/textmining then passed over to human editorial control (maybe with additional text mining to complete the indexing) but there is value in an index, I think, that can be used to help map knowledge structures across works and also support discovery tools that search within texts.

    To ground this, take the subject of business books. Indices mention companies (often with stylistic conventions such as bold font to make a semantic distinction that this is a company, for example), executives’ names (maybe italicised), and key terms (“innovation”, “data driven”, whatever). And page numbers of course (sometimes with some pages emphasised). Getting access to indices as data allows the construction of graphs within and across a work that provide new ways of working with the knowledge contained in those works. But can we get hold of indices? Not that I know of. (Note the separate issue about whether indices should be freely available, or available as paid for works in their own right. What I think I’m arguing for here is that the publishers do not seem to be exploring ways of extracting more value from the works that may benefit readers/cultural development and may usefully feed back into funded research that pays for the work that gets written up in the books the publishers sell. By making works available “as (structured) data”, at least they wouldn’t prevent others from trying to exploit the structural value of their publications. (Hmmm…. I’m reminded here about about researchers who won’t release data sets because they don’t want others to be able to make discoveries from the data before them…..)