Via @simonw’s rebooted blog, I spotted this – Landsat on AWS: “Landsat 8 data is available for anyone to use via Amazon S3. All Landsat 8 scenes are available from the start of imagery capture. All new Landsat 8 scenes are made available each day, often within hours of production.”
What do things like this mean for research, and teaching?
For research, I’m guessing we’ve gone from a state 20 years ago – no data [widely] available – to 10 years ago – available under license, with a delay and perhaps as periodics snapshots – to now – daily availability. How does this imapct on research, and what sorts of research are possible? And how well suited are legacy workflows and tools to supporting work that can make use of daily updated datasets?
For teaching, the potential is there to do activities around a particular dataset that is current, but this introduces all sorts of issues when trying to write and support the activity (eg we don’t know what specific features the data will turn up in the future). We struggle with this anyway trying to write activities that give students an element of free choice or open-ended exploration where we don’t specifically constrain what they do. Which is perhaps why we tend to be so controlling – there is little opportunity for us to respond to something a student discovers for themselves.
The realtime-ish ness of data means we could engage students with contemporary issues, and perhaps enthuse them about the potential of working with datasets that we can only hint at or provide a grounding for in the course materials. There are also opportunities for introducing students to datasets and workflows that they might be able to use in their workplace, and as such act as a vector for getting new ways of working out of the Academy and out of the tech hinterland that the Academy may be aware of, and into more SMEs (helping SMEs avail themselves of emerging capabilities via OUr students).
At a more practical level, I wonder, if OU academics (research or teaching related) wanted to explore the LandSat 8 data on AWS, would they know how to get started?
What sort of infrastructure, training or support do we need to make this sort of stuff accessible to folk who are interested in exploring it for the first time (other than Jupyter notebooks, RStudio, and Docker of course!;-) ?
PS Alan Levine /@cogdog picks up on the question of what’s possible now vs. then: http://cogdogblog.com/2017/11/landsat-imagery-30-years-later/. I might also note: this is how the blogosphere used to work on a daily basis 10-15 years ago…
I didn’t make it to ILI this year – for the first time they rejected all my submissions, so I guess I’m not even a proper shambrarian now :-( – but I was reminded of it, and the many interesting conversations I’ve had there in previous years, during a demo of the LEAN Library browser extension by Johan Tilstra (we’d first talked about a related application a couple of years ago at ILI).
The LEAN Library app is a browser extension (available for Chrome, Firefox, Safari) that can offers three services:
- Library Access: seamless access to subscription content using a Library’s ezproxy service;
- Library Assist: user support for particular subscription site;
- Library Alternatives: provide alternative sources for content that a user doesn’t have access to in one subscription service that they do have access to in another service.
Johan also stressed several points in the demo:
- once installed, and the user authenticated, the extension rewrites URLs on domains the library has subscribed to automatically; as soon as you land on a subscribed to site, you are redirected to a proxied version of the site that lets you download subscription content directly;
- the extension pops up a branded library panel on subscribed to sites seamlessly, unlike a bookmarklet that requires user action to trigger the proxy behaviour; because the pop-up appears without any other user interaction required, when a user visits a site that they didn’t know was subscribed to, they are informed of the fact. This is really useful for raising awareness amongst library patrons of the service that is being provided by the library.
I used to try to make a similar sort of point back when I used to bait libraries regularly, under the mantle of trying to get folk to think about “invisible library” services (as well as how to make folk realise they were using such services):
The LEAN Library extension is sensitised to sites subscribed to from the library from a whitelist downloaded from a remote configuration site. In fact, LEAN Library host the complete configuration UI, that allows library managers to define and style the content pop-up and define the publisher domains for which a subscription exists. (FWIW, I’m not clear what happens when a journal on a publisher site that is not part of a subscription package is the one the user wants to access?)
This approach has a couple of advantages:
- the extension doesn’t try every domain the user visits to see if it’s ezproxy-able, it has a local list of relevant domains;
- if the Library updates its subscription list, so is the extension.
That said, if a user does try to download content, it’s not necessarily obvious how the library knows that the proxy page was “enabled” by the extension. (If the URL rewriter added a code parameter to the URL, would that be trackable in the ezproxy logs?)
It’s interesting to see how this idea has evolved over the years. The LEAN Library approach certainly steps up the ease of use from the administrator side, and arguably for users to. For example, the configuration site makes it easy for admins to customise the properties of the extension, which used to require handcrafting in the original versions of this sort of application.
As to past – I wonder if it’s worth digging through my old posts on a related idea, the OU Library Traveller, to see whether there are any UX nuggets, or possible features, that might be worth exploring again? That started out as a bookmarklet and built on several ideas by Jon Udell, before moving to a Greasemonkey script (Greasemonkey was an extension that let you run your own custom scripts via the extension):
- OU Traveller (Library Remote) Script
- Generic Library Lookup Greasmeonkey Script
- OU Traveller GM Script – Multiple Lookups
- Generic Library Traveller
- OU Library Traveller – Title Lookup
- OU Library Traveller – eBook Greedy
PS In passing, I note that the OU libezproxy bookmarklet is still available. I still use my own bookmarklet several times a week. I also used to have a DOI version that let you highlight a doi and it would resolve it through the proxy (DOI and OpenURL Resolution)? There was a DOI-linkifier too, I think? Here’s a (legacy) related OU Library webservice: EZproxy DOI Resolver
Mentioning to a colleague yesterday that the UK Parliamentary library published research briefings and reports on topics of emerging interest, as well as to support legislation, that often provided a handy, informed, and politically neutral overview of a subject area that could make for a useful learning resource, the question was asked whether or not they might have anything on the “internet of things”. The answer is not much, but it got me thinking a bit more about the range of documents and document types produced across Parliament and Government that can be used to educate and inform, as well as contribute to debate.
In other words, to what extent might such documents be used in an educational sense, whether in the sense of providing knowledge and information about a topic, providing a structured review of a topic area and the issues associated with it, raising questions about an issue, or reporting on an analysis of it. (There are also opportunities for learning from some of the better Parliamentary debates, for example in terms of how to structure an argument, or explore the issues associated with an issue, but Hansard is out of scope of this post!)
(Also note that I’m coming at this as a technologist, interested as much in the social processes, concerns and consequences associated with science and technology as much as the deep equations and principles that tend to be be taught as the core of the subject, at least in HE. And that I’m interested not just on how we can support the teaching and learning of current undergrads, but also how we can enculturate them into the availability and use of certain types of resource that are likely to continue being produced into the future, and as such provide a class of resources that will continue to support the learning and education of students once they leave formal education.)
So using IoT as a hook to provide examples, here’s the range of documents I came up with. (At some point it maybe worth tabulating this to properly summarise the sorts of information these reports might came, the communicative point of the document (to inform, persuade, provide evidence for or against something, etc), and any political bias that may be likely (in policy docs, for example).
Parliamentary Library Research Briefings
The Parliamentary Library produces a range of research briefings to cover matters of general interest (Commons Briefing papers, Lords Library notes), perhaps identified through multiple questions asked of the Library by members?, as well as background to legislation (Commons Debate Packs, Lords in Focus), through the Commons and Lords Libraries respectively.
Some of the research briefings include data sets (do a web search for
site:http://researchbriefings.files.parliament.uk/ filetype:xlsx) which can also be quite handy.
There are also POSTnotes from the Parliamentary Office of Science and Technology, aka POST.
For access to briefings on matters currently in the House, the Parliament website provides timely/handy pages that list briefings documents for matters in the House today/this week. In addition, there are feeds available for recent briefings from all three: Commons Briefing Papers feed, Lords Library Notes feed, POSTnotes feed. If you’re looking for long reads and still use a feed reader, get subscribing;-)
Wider Parliamentary Documents
The Parliament website also supports navigation of topical issues such as Science and Technology, as well as sub-topics, such as Internet and Cybercrime. (I’m not sure how the topics/sub-topics are identified or how the graph is structured… That may be one to ask about when I chat to Parliamentary Library folk next week?:-)
Within the topic areas, relevant Commons and Lords related Library research briefings are listed, as well as
POSTnotes, Select Committee Reports and Early Day Motions.
(By the by, it’s also worth noting that chunks of the Parliament website are currently in scope of a website redesign.)
Along with legislation currently going through Parliament that is published on the Parliament website (along with Hansard reports that record, verbatim(-ish!) proceedings of debates in either House), explanatory notes provided by the Government department bringing a bill provide additional, supposedly more accessible/readable, information around it.
Reports are also published by government offices. For example, the Blackett review (2014) on the Internet of things was a Government Office for Science report from the UK Government Chief Scientific Adviser at the time (The Internet of Things: making the most of the Second Digital Revolution). Or how about a report from the Intellectual Property Office on Eight great technologies: The internet of things.
Briefing documents also appear in a variety of other guises. For example, competitions (such as the Centre for Defence Enterprise (CDE) competition on security for the internet of things, or Ofcom’s consultation on More radio spectrum for the Internet of Things) and consultations may both provide examples of how to start asking questions about a particular topic area (questions that may help to develop critical thinking, prompt critical reflection, or even provide ideas for assessment!).
Occasionally, you can also turn up a risk assessments or cost benefit analysis, such as this Department for Business, Energy & Industrial Strategy Smart meter roll-out (GB): cost-benefit analysis.
EC Parliamentary Research Service
In passing, it’s also worth noting that the EC Parliamentary Research Service also do briefings, such as this report on The Internet Of Things: Opportunities And Challenges, as well as publishing resources linked from topic based pages, such as the Digital Single Market them topic page on The Internet of Things.
In providing support for all members of the House, the Parliamentary research services must produce research briefings that can be used by both sides of the House. This may stand in contrast to documents produced by Government that may be influenced by particular policy (and political) objectives, or formal reports published by industry bodies and the big consultancies (the latter often producing reports that are either commissioned on behalf of government or published to try to promote some sort of commercial interest that can be sold to government) that may have a lobbying aim.
As I’ve suggested previously, (News, Courses and Scrutiny and Learning Problems and Consultation Based Curricula), maybe we could/should be making more use of them as part of higher education course readings, not just as a way of getting a quick, NPOV view over a topic area, bus also as a way of introduce students to a form of free and informed content, produced in timely way in response to issues of the day. In short, a source that will continue to remain relevant and current over the coming years, as students (hopefully) become lifelong, continuing learners.
OU Library guidance to students on citations for journal articles reads as follows:
Using this reference I should be able to run a pretty good known item search – or not, as the case may be?
So where does the full reference – Journal, for example – help exactly? On Google, maybe… (Actually, the one search may search all tuples in different fields – so the title as well as a the journal title – and generate retrieval/ranking factors based on that?)
References and search contexts are complementary – for a reference to be effective,it needs to work with your search context, which typically means the user interface of your search system – for a specific known item reference this typically means the (hidden away) advanced search interface.
So I wonder: whilst we penalise students from not using full, formal references (even though they often provide enough of a reference to find the item on Google), the officially provided search tools don’t let you use the information in the formal reference in a structured way to retrieve and hopefully access (rather than discover – the reference is the discovery component) the desired item?
Or am I reading the above search UI incorrectly…?
PS in terms of teaching material design, and referencing the above citation example, erm….?
Because of course I’m not searching for a Journal that has something to do with Frodo Baggins – I’m searching for an article…
PPS I’m also finding more and more that the subscription journal content I want to access is from journals that the OU Library doesn’t subscribe to. I’m not sure how many of the journals it does subscribe to that are bundled are never accessed (the data should reveal that)? So I wonder – as academics (and maybe students), should we instead be given a budget code we could use to buy the articles we want? And for articles used by students in courses, get a ‘site license” for some articles?
Now what was the URL of that pirated academic content site someone from the library told me about again…?
PS from the Library – don’s use the reference – just bung the title in the search onebox like you would do on a web search engine..
Hmm… but if I have a full reference, I should be able to run a search that returns just a single result, for exactly the item I want? Or maybe returns links to a few different instances (from different suppliers) of just that resource? But then – which is the preferred one (the Library search ranks different suppliers of the same work according to what algorithm?)
Or perhaps the library isn’t really about supporting know item retrieval – it’s about supporting serendipity and the serendipitous discovery of related items? (Though that begs the question about how the related item list is algorithmically generated?)
Or maybe ease of use has won out – and running a scruffy search then filtering down by facet gives a good chance of effective retrieval with an element of serendipity around similar resources?
(Desperately tries to remember all the arguments libraries used to make against one box searching…)
After a break of a couple of years, I’ll be doing a couple of sessions at ILI 2016 next week; for readers with a long memory, the Internet Librarian International conference is where I used to go to berate academic librarians every year about how they weren’t keeping up with the internet, and this year will perhaps be a return to those stomping grounds for me ;-)
One of the questions I used to ask – and probably still could – was where in the university I should go to get help with visualising a data set, either to help me make sense of it, or as part of a communications exercise. Back when IT was new, libraries used to be a place you could go to get help with certain sorts of information skills and study skills (such as essay writing skills), which included bits of training and advice on how to use appropriate software applications. As the base level in digital information skills increases – many people are able to figure out how to open a spreadsheet on their own.
But has the Overton window for librarians offering IT support developed with the times? Where should students wanting to develop more refined skills – how to start cleaning a dataset, for example, or visualising one sensibly, or even just learning how to read a chart properly on the one hand, or tell a story with data on the other – actually go? And what about patrons who want to be able to make better use of automation to help them in their information related tasks (screenscraping, for example, or extracting text, images or tables from hundreds of pages of PDFs); or who want help accessing commodity “AI” services accessed via APIs? Or need support in writing scientific communications that sensibly embed code and its outputs, or mathematical or musical notation, particular for a web based (i.e. potentially interactive) journal or publication? Or who just need to style a chart in a particular way?
Now it’s quite likely that, having been playing with tech for more years than I care to remember, I’m afflicted by “the curse of knowledge“, recently contextualised for libraries by Lorcan Dempsey, quoting Steven Pinker. In the above paragraph, I half-assume readers know what screenscraping is, for example, as well as why it’s blindingly obvious (?!) why you might want to be able to do it, even if you don’t know how to do it? (For librarians, there’s a couple of things to note there: firstly, what it is and why you might want to do it; secondly, which might be a referral to tools, if not training, what sorts of tool might be able to help you with it.)
But the question remains – there’s a lot of tech power tools out there that can help you retrieve, sort, search, analyse, organise and present information out there, but where do I go for help?
If not the library, where?
If not the library, why not?
The end result is often: the internet. For which, for many in the UK, read: Google.
Anyway, via the twitterz (I think…) a couple of weeks ago, I spotted this interesting looking job ad from Harvard:
School/Unit Harvard College Library
Location USA – MA – Cambridge
Job Function Library
Time Status Full-time
Department Harvard College Library – Services for Maps, Media, Data, and Government Information
Duties & Responsibilities – Summary
Reporting to the Head, Social Sciences and Visualization in the unit for Maps, Media, Data and Government Information, the Visualization Specialist works with staff and faculty to identify hardware and software needs, and to develop scalable, sustainable practices related to data visualization services. This position designs and delivers workshops and training sessions on data visualization tools and methods, and develops a range of instructional materials to support library users with data visualization needs in the Social Sciences and Humanities.
The Visualization Specialist will coordinate responsibilities with other unit staff and may supervise student employees.
Duties and Responsibilities
– Advises, consults, instructs, and serves as technical lead with data visualization projects with library, faculty teaching, and courses where students are using data.
– Identifies, evaluates and recommends new and emerging digital research tools for the Libraries and Harvard research community.
– Develops and supports visualization services in response to current trends, teaching and learning–especially as it intersects with Library collections and programs.
– Collaborates in developing ideas and concepts effectively across diverse interdisciplinary audiences and serves as a point person for data visualization and analysis efforts in the Libraries and is attuned to both the quantitative and qualitative uses with datasets. Understands user needs for disseminating their visualizations as either static objects for print publications or interactive online objects to engage with.
– Develops relationships with campus units supporting digital research, including the Center for Government and International Studies, Institute for Quantitative Social Sciences, and Harvard Library Central Services, and academic departments engaged in data analysis and visualization.
– Develops, collects, and curates exemplar data sets from multiple fields to be used in visualization workshops and training materials
– ALA-accredited master’s degree in Library or Information Science OR advanced degree in Social Sciences, Psychology, Design, Informatics, Statistics, or Humanities.
– Minimum of 3 years experience in working with data analysis and visualization in an academic setting.
– Demonstrated experience with data visualization tools and programming libraries.
– Proficiency with at least one programming language (such as Python and R).
– Ability to use a variety of tools to extract and manipulate data from various sources (such as relational databases, XML, web services and APIs).
Experience supporting data analysis and visualization in a research setting.
Proficiency using tools and programming libraries to support text analysis.
Familiarity with geospatial technology.
Experience identifying and recommending new tools, technologies, and online delivery of visualizations.
Graphic design skills and proficiency using relevant software.
Many of the requisite skills resonate with the calls (taunts?) I used to make asking library folk where I should go for support with data related based questions. At which point you may be thinking – “okay, techie geeky stuff… scary…not our gig…”.
But if you focus of data visualisation many of which actually relate to communication issues – representation and presentation – rather than technical calls for help. For example, what sort of chart should I use to communicate this sort of thing? How can I change the look of a chart? How can I redesign a chart to help me communicate with it better?
And it’s not just the presentation of graphical information. Part of the reason I put together the F1 Data Junkie book was that I wanted to explore the RStudio/RMarkdown (Rmd) workflow for creating (stylish) technical documents. Just the other day I noticed that in the same way charts can be themed, themes for Rmd documents are now starting to appear – such as tint (Tint Is Not Tufte); in fact, it seems there’s a whole range of output themes already defined (see also several other HTML themes for Rmd output).
What’s nice about these templates is that they are defined separately from the actual source document. If you want to change from one format to another, things like the R rticles package make it easy. But how many librarians even know such workflows exist? How many have even heard of markdown?
It seems to me that tools around document creation are in a really exciting place at the moment, made more exciting once you start to think about how they fit into wider workflows (which actually makes them harder to promote, because folk are wedded to their current crappy workflows).
So are the librarians on board with that, at least, given their earlier history as word-processor evangelists?
See also: A New Role for the Library –
Gonzo Librarian Informationista, including the comment Notes on: Exploring New Roles for Librarians. This also touches on the notion of an embedded librarian.
And this, from Martin Bean, previously VC of the OU, several years ago…
As well as offering digital application shelves, should libraries offer, or act as instituional sponsors of, digital workbenches?
I’ve previously blogged about things like SageMathCloud, and application based learning environment, and the IBM Data Scientist Workbench, and today came across another example: DHBox, CUNY’s digital humanities lab in the cloud (wiki), which looks like it may have been part of a Masters project?
If you select the demo option, a lab context is spawned for you, and provides access to a range of tools: staples, such as RStudio and Jupyter notebooks, a Linux terminal, and several website creation tools: Brackets, Omeka and WordPress (though the latter two didn’t work for me).
(The toolbar menu reminded me of Stringle / DockLE ;-)
There’s also a file browser, which provides a common space for organising – and uploading – your own files. Files created in one application are saved to the shared file area and available for use on other applications.
The applications are being a (demo) password authentication scheme, which makes me wonder if persistent accounts are in the project timeline?
Once inside the application, you have full control over it. If you need additional packages in RStudio, for example, then just install them:
They work, too!
On the Jupyter notebook front, you get access to Python3 and R kernels:
In passing, I notice that RStudio’s RMarkdown now demonstrates some notebook like activity, demonstrating the convergence between document formats such as Rmd (and ipymd) and notebook style UIs [video].
Code for running your own DHBox installation is available on Github (DH-Box/dhbox), though I haven’t had a chance to give it a try yet. One thing it’d be nice to see is a simple tutorial showing how to add in another tool of your own (OpenRefine, for example?) If I get a chance to play with this – and can get it running – I’ll try to see if I can figure out such an example.
It also reminded me that I need to play with my own install of tmpnb, not least because of the claim that “tmpnb can run any Docker container”. Which means I should be able to set up my own tmpRStudio, or tmpOpenRefine environment?
If visionary C. Titus Brown gets his way with a pitched for MyBinder hackathon, that might extend that project’s support for additional data science applications such as RStudio, as well as generalising the infrastructure on which myBinder can run. Such as Reclaimed personal hosting environments, perhaps?!;-)
That such combinations are now popping up all over the web makes me think that they’ll be a commodity service anytime soon. I’d be happy to argue this sort of thing could be used to support a “technology enhanced learning environment”, as well as extending naturally into“technology enhanced research environments”, but from what I can tell, TEL means learning analytics and not practical digital tools used to develop digital skills? (You could probably track the hell of of people using such environments if you wanted to, though I still don’t see what benefits are supposed to accrue from such activity?)
It also means I need to start looking out for a new emerging trend to follow, not least because data2text is already being commoditised at the toy/play entry level. And it won’t be VR. (Pound to a penny the Second Life hipster, hypster, shysters will be chasing that. Any VR campuses out there yet?!) I’d like to think we might see inroads being made into AR, but suspect that too will always be niche, outside certain industry and marketing applications. So… hmmm… Allotments… that’s where the action’ll be… and not in a tech sense…
Earlier today, I came across BioShaDock: a community driven bioinformatics shared Docker-based tools registry (BioShadock registry). This collects together a wide range of containerised applications and tools relevant to the bioinformatics community. Users can take one or more applications “off-the-shelf” and run them, without having to go through any complicated software installation process themselves, even if the installation process is a nightmare confusion of dependency hell: the tools are preinstalled and ready to run…
The container images essentially represent reference images that can be freely used by the community. The application containers come pre-installed and ready to run, exact replicas of their parent reference image. The images can be versioned with different versions or builds of the application, so you can reference the use of a particular version of an application and provide a way of sharing exactly that version with other people.
So could we imagine this as a specialist reference shelf in a Digital Library? A specialist reference application shelf, with off-the-shelf, ready-to-run run tools, anywhere, any time?
Another of the nice things about containers is that you can wire them together using things like Docker Compose or Panamax templates to provide a suite of integrated applications that can work with each other. Linked containers can pass information between each other in isolation from the rest of the world. One click can provision and launch all the applications, wired together. And everything can be versioned and archived. Containerised operations can also be sequenced too (eg using DRAY docker pipelines or OpenAPI).
Sometimes, you might want to bundle a set of applications together in a single, shareable package as a virtual machine. These can be versioned, and shared, so everyone has access to the same tools installed in the same way within a single virtual machine. Things like the DH Box, “a push-button Digital Humanities laboratory” (DHBox on github); or the Data Science Toolbox. These could go on to another part of the digital library applications shelf – a more “general purpose toolbox” area, perhaps?
As a complement to the “computer area” in the physical library that provides access to software on particular desktops, the digital library could have “execution rooms” that will actually let you run the applications taken off the shelf, and access them through a web browser UI, for example. So runtime environments like mybinder or tmpnb. Go the the digital library execution room (which is just a web page, though you may have to authenticate to gain access to the server that will actually run the code for you..), say which container, container collection, or reference VM you want to run, and click “start”. Or take the images home you with (that is, run them on your own computer, or on a third party host).
Some fragments relating to the above to try to help me situate this idea of runnable, packaged application shelves with the context of the library in general…
- libraries have been, and still are, one of the places you go access IT equipment and learn IT skills;
- libraries used to be, and still are, a place you could go to get advice on, and training in, advanced information skills, particularly discovery, critical reading and presentation;
- libraries used to be, and still, a locus for collections of things that are often valuable to community or communities associated with a particular library;
- libraries provide access to reference works or reference materials that provide a common “axiomatic” basis for particular activities;
- libraries are places that provide access to commercial databases;
- libraries provide archival and preservation services;
- libraries may be organisational units that support data and information management needs of their host organisation.
Some more fragments:
- the creation of a particular piece of work may involve many different steps;
- one or more specific tools may be involved in the execution of each step;
- general purpose tools may support the actions required perform a range of tasks to a “good enough” level of quality;
- specialist tools may provide a more powerful environment for performing a particular task to a higher level of quality
- what tools are available for performing a particular information related task or set of tasks?
- what are the best tools for performing a particular information related task or set of tasks?
- where can I get access to the tools required for a particular task without having to install them myself?
- how can I effectively organise a workflow that requires the use of several different tools?
- how can I preserve, document or reference the workflow so that I can reuse it or share it with others?
- Docker containers provide a way of packaging an application or tool so that it can be “run anywhere”;
- Docker containers may be linked together in particular compositions so that they can interoperate with each other;
- docker container images may be grouped together in collections within a subject specific registry: for example, BioShaDock.
Via @Megan_Lucero, I notice that the Sunday Times data journalism team have published “a list of companies in Panama set up by Mossack Fonseca and its associates, as well the directors, shareholders and legal agents of those companies, as reported to the Panama companies registry”: Sunday Times Panama Papers Database.
Here’s a quick start to getting the data (which is available for download) into a form you can start to play with using SQLite3.
- Download and install SQLite3
- download the data from the Sunday Times and unzip it
- on the command line/terminal, cd into the unzipped directory
- create a new SQLite3 database: sqlite3 sundayTimesPanamaPapers.sqlite
- you should now be presented with a SQLite console command line. Run the command: .mode csv
- we’ll now create a table to put the data into: CREATE TABLE panama(company_url TEXT,company_name TEXT,officer_position_es TEXT,officer_position_en TEXT,officer_name TEXT,inc_date TEXT,dissolved_date TEXT,updated_date TEXT,company_type TEXT,mf_link TEXT);
- We can now import the data – the header row will be included but this is quick’n’dirty, right? .import sunday_times_panama_data.csv panama
- so let’s poke the data – preview the first few lines: SELECT * FROM panama LIMIT 5;
- let’s see which officers are names the most: SELECT officer_name,COUNT(*) as c FROM panama GROUP BY officer_name ORDER BY c DESC LIMIT 10;
- see what officer roles there are: SELECT DISTINCT officer_position_en FROM panama;
- see what people have most : SELECT officer_name,officer_position_en, COUNT(*) as c FROM panama WHERE officer_position_en='Director/President' OR officer_position_en='President' GROUP BY officer_name,officer_position_en ORDER BY c DESC LIMIT 10;
- exit SQLite console by running: .q
- to start a new session from the command line: sqlite3 sundayTimesPanamaPapers.sqlite (you won’t need to load the data in again, you can get started with a SELECT straightaway).
PS FWIW, I’d consider the above to be a basic skill for anyone who calls themselves an information professional… Which includes the Library…;-) [To qualify that, here’s an example question: “I just found this data on the Panama Papers and want to see which people seemed to be directors of a lot of companies; can I do that?”]
Some reflections on reading a subscription based, “scholarly” ebook just now…
First, I can read the book online, or download it.
If I download it I need to get myself a special pair of spectacles to the read magic ink it’s written it.
I also need to say how long I want to the “loan period” to be. (I don’t know if this is metered according to multiples of 24 hours, or end the of the calendar based return day.) At the end of the loan period, I think I can keep the book but suspect that a library rep will come round to my house and either lock the book in a box to which only they have the key, or run the pages through a shredder (I’m not sure which).
Looking at the online copy of the book, there are various quotas associated with how I can use it.
A tool bar provides me with various options: the Image view is a crappy resolution view, albeit one that provides and infinite scroll through the book.
The PDF view lets me view a PDF version of the current page, though I can’t copy from it. (I do seem to be able to download it though, using the embedded PDF reader, without affecting any quotas?)
If I select the Copy option, it takes me into the PDF view and does let me highlight and copy text from that page. However, if I try to copy from too many pages, that “privilege” is removed…
As far as user experience goes, pretty rubbish on first use, and many of the benefits of having the electronic version, as compared to a print version, have been defensively (aggressively?!) coded against. This doesn’t achieve anything other than introduce inconvenience. So for example, having run out of my copy quota, I manually typed a copy of the sentence I wasn’t allowed to highlight and cmd/ctrl-C.
A couple of months ago, I posted A Quick Round-Up of Some *-Recognition Service APIs that described several off-the-shelf cloud hosted services from Google and IBM for processing text, audio and images.
So what’s on offer?
- Computer Vision API: extract semantic features from an image, identify famous people (for some definition of “famous” that I can’t fathom), and extract text from images; 5,000 free transactions per month;
- Emotion API: extract emotion features from a photo of a person; photos – 30,000 free transactions per month;
- Face API: extract face specific information from an image (location of facial features in an image); 30,000 free transactions per month;
- Video API: 300 free transactions per month per feature.
- Custom Recognition Intelligent Service (CRIS): customise the acoustic environment for speaker/speech recognition services; free private preview by invitation;
- Speaker Recognition API: identify the person speaking in an audio file; 10,000 free transactions per month;
- Speech API: speech to text and text to speech services; 5,000 free transactions per month
- Bing Spell Check API: 5,000 free transactions per month
- Language Understanding Intelligent Service (LUIS): language models for parsing texts; 100,000 free transactions per month;
- Linguistic Analysis API: NLP sentence parser, I think… (tokenisation, parts of speech tagging, etc.) It’s dog slow and, from the times I got it to sort of work, this seems to be about the limit of what it can cope with (and even then it takes forever):
5,000 free transactions per month, 120 per minute (but you’d be luck to get anything done in a minute…);
- Text Analytics API: sentiment analysis, topic detection and key phrase detection, language extraction; 5,000 free transactions;
- Web Language Model API: “wordsplitter” – put in a string of words as a single string with space characters removed, and it’ll try to split the words out; 100,000 free transactions per month.
- Academic Knowledge API: search Microsoft Academic? 10,000 transactions per month;
- Entity Linking Intelligence Service: 1000 free transactions per day;
- Knowledge Exploration Service: not sure….?! Up to 10,000 objects, 1000 transactions free;
- Recommendations API: recommender based on your own data? 10,000 free transactions per month.
- Bing Autosuggest API: Up to 10,000 transactions per month;
- Bing Image Search API: Up to 1,000 transactions per month across all Bing Search APIs;
- Bing News Search API: Up to 1,000 free transactions per month across all Bing Search APIs;
- Bing Video Search API: Up to 1,000 free transactions per month across all Bing Search APIs;
- Bing Web Search API: Up to 1,000 transactions per month across all Bing Search APIs.
There’s also a gallery of demo apps built around the APIs.
It’s seems then that we’ve moved into an era of commodity computing at the level of automated identification and metadata services, though many of them are still pretty ropey… The extent to which they will be developed and continue to improve will be the proof of just how useful they will be as utility services.
As far as the free usage caps on the Microsoft services, there seems to be a reasonable amount of freedom built in for folk who might want to try out some of these services in a teaching or research context. (I’m not sure if there are blocks for these services that can be wired in to the experiment flows in the Azure Machine Learning studio?)
I also wonder whether these are just the sorts of service that libraries should be aware of, and perhaps even work with in an informationista context…?!;-)
PS from the face, emotion and vision APIs, and perhaps entity extraction and sentiment analysis applied to any text extracted from images, I wonder if you could generate a range of stories automagically from a set of images. Would that be “art”? Or just #ds106 style playfulness?!
PPS Nov 2016 for photo-tagging, see also Amazon Rekognition.