Archive for the ‘Anything you want’ Category
I came across Apache Tika a few weeks ago, a service that will tell you what pretty much any document type is based on it’s metadata, and will have a good go at extracting text from it.
With a prompt and a 101 from @IgorBrigadir, it was pretty easier getting started with it – sort of…
First up, I needed to get the Apache Tika server running. As there’s a containerised version available on dockerhub (logicalspark/docker-tikaserver), it was simple enough for me to fire up a server in a click using tutum (as described in this post on how to run OpenRefine in the cloud in just a couple of clicks and for a few pennies an hour; pretty much all you need to do is fire up a server, start a container based on logicalspark/docker-tikaserver, and tick to make the port public…)
His suggested recipe for using python requests library borked for me – I couldn’t get python to open the file to get the data bits to send to the server (file encoding issues; one reason for using Tika is it’ll try to accept pretty much anything you throw at it…)
I had a look at pycurl:
!apt-get install -y libcurl4-openssl-dev
!pip3 install pycurl
but couldn’t make head or tail of how to use it: the pycurl equivalant of curl -T foo.doc http://example.com:9998/rmeta can’t be that hard to write, can it? (Translations appreciated via the comments…;-)
Instead I took the approach of dumping the result of a curl request on the command line into a file:
!curl -T Text/foo.doc http://example.com:9998/rmeta > tikatest.json
and then grabbing the response out of that:
Not elegant, and far from ideal, but a stop gap for now.
Part of the response from the Tika server is the text extracted from the document, which can then provide the basis for some style free text analysis…
I haven’t tried with any document types other than crappy old MS Word .doc formats, but this looks like it could be a really easy tool to use.
And with the containerised version available, and tutum and Digital Ocean to hand, it’s easy enough to fire up a version in the cloud, let alone my desktop, whenever I need it:-)
I’m not sure how many Chrome users follow any of the Google blogs that occasionally describe forthcoming updates to Google warez, but if you don’t you perhaps don’t realise quite how frequently things change. My browser, for example, is at something like version 40, even though I never consciously update it.
One thing I only noticed recently that a tab appeared in the top right hand of the browser showing that I’m logged in (to the browser) with a particular Google account. There doesn’t actually appear to be an option to log out – I can switch user or go incognito – and I’m not sure I remember even consciously logging in to it (actually, maybe a hazy memory, when I wanted to install a particular extension) and I have no idea what it actually means for me to be logged in?
Via the Google Apps Update blog, I learned today that being logged in to the browser will soon support is seemless synching of my Google docs into my Chrome browser environment (Offline access to Google Docs editors auto-enabled when signing into Chrome browser on the web). Following a pattern popularised by Apple, Google are innovating on our behalf and automatically opting us in to behaviours it thinks make sense for us. So just bear that in mind when you write a ranty resignation letter in Google docs and wonder why it’s synched to your work computer on your office desk:
Note that Google Apps users should not sign into a Chrome browser on public/non-work computers with their Google Apps accounts to avoid unintended file syncing.
If you actually have several Google apps accounts (for example, I have a personal one, and a couple of organisational ones: an OU one, an OKF one), I assume that the only docs that are synched are the ones on an account that matches the account I have signed in to in the browser. That said, synch permissions may be managed centrally for organisational apps accounts:
Google Apps admins can still centrally enable or disable offline access for their domain in the Admin console .. . Existing settings for domain-level offline access will not be altered by this launch.
I can’t help but admit that even though I won’t have consciously opted in to this feature, just like I don’t really remember logging in to Chrome on my desktop (how do I log out???) and I presumably agreed to something when I installed Chrome to let it keep updating itself without prompting me, I will undoubtedly find it useful one day: on a train, perhaps, when trying to update a document I’d forgotten to synch. It will be so convenient I will find it unremarkable, not noticing I can now do something I couldn’t do as easily before. Or I might notice, with a “darn, I wish I’d..” then “oh, cool, [kewel…] I can…).
“‘Oceania has always been at war with Eastasia.'” [George Orwell, 1984]
Just like when – after being sure I’d disabled or explicitly no; opted in to any sort of geo-locating or geo-tracking behaviour on my Android phone, I found I must have left a door open somewhere (or been automatically opted in to something I hadn’t appreciated when agreeing to a particular update (or by proxy, agreeing to allow something to update itself automatically and without prompting and with implied or explicit permission to automatically opt me in to new features….) and found I could locate my misplaced phone using the Android Device Manager (Where’s My Phone?).
This idea of allowing applications to update themselves in the background and without prompting is something we have become familiar with in many web apps, and in desktop apps such as Google Chrome, though many apps do still require the user to either accept the update or take an even more positive action to install an update when notified that one is available. (It seems that ever fewer apps require you to specifically search for updates…)
In the software world, we have gone from a world where the things we buy we immutable, to one where we could search for and install updates (eg to operating systems of software applications), then accept to install updates when alerted to the fact, to automatically (and invisibly) accepting updates.
In turn, many physical devices have gone from being purely mechanical affairs, to electro-mechanical ones, to logical-electro-mechanical devices (for example, that include logic elements hardwired into silicon), to ones containing factory programmable hardware devices (PROMs, programmable Read Only Memories), to devices that run programmable and then re</programmable firmware (that is to say, software).
If you have a games console, a Roku or MyTV box, or Smart TV, you’ve probably already been prompted to get a (free) online update. I don’t know, but could imagine, new top end cars having engine management system updates at regular service events.
However, one thing perhaps we don’t fully appreciate is that these updates can also be used to limit functionality that our devices previously had. If the updates are done seemlessly (without permission, in the background) this may come as something as a surprise. [Cf. the complementary issue of vendors having access to “their” content on “your” machine, as described here by the Guardian: Amazon wipes customer’s Kindle and deletes account with no explanation]
A good example of loss of functionality arising by an (enforced, though self-applied) firmware update was reported recently in the context of hobbiest drones:
On Wednesday, SZ DJI Technology, the Chinese company responsible for the popular DJI Phantom drones that online retailers sell for less than $500, announced that it had prepared a downloadable firmware update for next week that will prevent drones from taking off in restricted zones and prevent flight into those zones.
Michael Perry, a spokesman for DJI, told the Guardian that GPS locating made such an update possible: “We have been restricting flight near airports for almost a year.”
“The compass can tell when it is near a no-fly zone,” Perry said. “If, for some reason, a pilot is able to fly into a restricted zone and then the GPS senses it’s in a no-fly zone, the system will automatically land itself.”
DJI’s new Phantom drones will ship with the update installed, and owners of older devices will have to download it in order to receive future updates.
What correlates might be applied to increasingly intelligent cars, I wonder?! Or at the other extreme, phones..?
PS How to log out of Chrome You need to administer yourself… From the Chrome Preferences Settings (sic), Disconnect your Google account.
Note that you have to take additional action to make sure that you remove all those synched presentations you’d prepared for job interviews at other companies from the actual computer…
Take care out there…!;-)
A couple of years ago or so, Dropbox ran a promotion for academic users granting 15GB of space. Yesterday, I got an email:
As part of your school’s participation in Space Race, you received 15 GB of additional Dropbox space. The Space Race promotional period expires on March 4, 2015, at which point your Dropbox limit will automatically return to 5 GB.
As a friendly reminder, you’re currently using 14.6 GB of Dropbox space. If you’re over your 5 GB limit after March 4, you’ll no longer be able to save new photos, videos, and docs to Dropbox.
Need more space? Dropbox Pro gives you 1 TB of space to keep everything safe, plus advanced sharing controls, remote wipe for lost devices, and priority support. Upgrade before March 4 and we’ll give you 30% off your first year.
My initial thought was to tweet:
but then I thought again… The discounted price on a monthly payment plan is £5.59/month which on PayPal converted this month to $8.71. I use Dropbox all the time, and it forms part of my workflow for using Leanpub. As it’s the start of the month, I received a small royalty payment for the Wrangling F1 Data With R book. The Dropbox fee is about amount I’m getting per book sold, so it seems churlish not to subscribe to Dropbox – it is part of the cost of doing business, as it were.
The Dropbox subscription gets me 1TB, so this also got me thinking:
- space is not now an issue, so I can move the majority of my files to Dropbox, not just a selection of folders;
- space is not now an issue, so I can put all my github clones into Dropboxl
- space is now now an issue, so though it probably goes against terms of service, I guess I could set up toplevel “family member” folders and we could all share the one subscription account, just selectively synching our own folders?
In essence, I can pretty much move to Dropbox (save for those files I don’t want to share/expose to US servers etc etc; just in passing, one thing Dropbox doesn’t seem to want to let me do is change the account email to another email address that I have another Dropbox account associated with. So I have a bit of an issue with juggling accounts…)
When I started my Wrangling F1 Data With R experiment, the intention was always to make use of any royalties to cover the costs associated with that activity. Leanpub pays out if you are owed more than $40 collected in the run up to 45 days ahead of a payment date (so the Feb 1st payout was any monies collected up to mid-December and not refunded since). If I reckon on selling 10 books a month, that gives me about $75 at current running. Selling 5 a month (so one a week) means it could be hit or miss whether I make the minimum amount to receive a payment for that month. (I could of course put the price up. Leanpub lets you set a minimum price but allows purchasers to pay what they want. I think $20 is the highest amount paid for a copy I’ve had to date, which generated a royalty of $17.50 (whoever that was – thank you :-)) You can also give free or discounted promo coupons away.) As part of the project is to explore ways of identifying and communicating motorsport stories, I’ve spent royalties so far on:
- a subscription to GP+ (not least because I aspire to getting a chart in there!;-);
- a subscription to the Autosport online content, in part to gain access to forix, which I’d forgotten is rubbish;
- a small donation to sidepodcast, because it’s been my favourite F1 podcast for a long time.
Any books I buy in future relating to sports stats or motorsport will be covered henceforth from this pot. Any tickets I buy for motorsport events, and programmes at such events, will also be covered from this pot. Unfortunately, the price of an F1 ticket/weekend is just too much. A Sky F1 Channel subscription or day passes is also ruled out because I can’t for the life of me work out how much it’ll cost or how to subscribe; but I suspect it’ll be more than the £10 or so I’d be willing to pay per race (where race means all sessions in a race weekend). If my F1 iOS app subscription needs updating that’ll also count. Domain name registration (for example, I recently bought f1datajunkie.com) is about £15/$25 a year from my current provider. (Hmm, that seems a bit steep?) I subscribe to Racecar Engineering (£45/$70 or so per year), the cost of which will get added to the mix. A “big ticket” item I’m saving for (my royalties aren’t that much) on the wants list is a radio scanner to listen in to driver comms at race events (I assume it’d work?). I’d like to be able to make a small regular donation to help keep the ergast site on, but can’t see how to… I need to bear in mind tax payments, but also consider the above as legitimate costs of a self-employed business experiment.
I also figure that as an online publishing venture, any royalties should also go to supporting other digital tools I make use of as part of it. Some time ago, I bought in to the pinboard.in social bookmarking service, I used to have a flickr pro subscription (hmm, I possibly still do? Is there any point…?!) and I spend $13 a year with WordPress.com on domain mapping. In the past I have also gone ad-free ($30 per year). I am considering moving to another host such as Squarespace ($8 per month), because WordPress is too constraining, but am wary of what the migration will involve and how much will break. Whilst self-hosting appeals, I don’t want the grief of doing my own admin if things go pear shaped.
I’m a heavy user of RStudio, and have posted a couple of Shiny apps. I can probably get by on the shinyapps.io free plan for a bit (10 apps) – just – but the step up to the basic plan at $39 a month is too steep.
I used to use Scraperwiki a lot, but have moved away from running any persistent scrapers for some time now. morph.io (which is essentially Scraperwiki classic) is currently free – though looks like a subscription will appear at some point – so I may try to get back into scraping in the background using that service. The Scraperwiki commercial plan is $9/month for 10 scrapers, $29 per month for 100. I have tended in the past to run very small scrapers, which means the number of scrapers can explode quickly, but $29/month is too much.
I also make use of github on a free/open plan, and while I don’t currently have any need for private repos, the entry level micro-plan ($7/month) offers 5. I guess I could use a (private?) github rather than Dropbox for feeding Leanpub, so this might make sense. Of course, I could just treat such a subscription as a regular donation.
It would be quite nice to have access to IPython notebooks online. The easiest solution to this is probably something like wakari.io, which comes in at $25/month, which again is a little bit steep for me at the moment.
In my head, I figure £5/$8/month is about one book per month, £10/$15 is two, £15/$20 is three, £25/$40 is 5. I figure I use these services and I’m making a small amount of pin money from things associated with that use. To help guarantee continuity in provision and maintenance of these services, I can use the first step of a bucket brigade style credit apportionment mechanism to redistribute some of the financial benefits these services have helped me realise.
Ideally, what I’d like to do is spend royalties from 1 book per service per month, perhaps even via sponsored links… (Hmm, there’s a thought – “support coupons” with minimum prices set at the level to cover the costs of running a particular service for one month, with batches of 12 coupons published per service per year… Transparent pricing, hypothecated to specific costs!)
Of course, I could also start looking at running my own services in the cloud, but the additional time cost of getting up and running, as well as hassle of administration, and the stress related to the fear of coping in the face of attack or things properly breaking, means I prefer managed online services where I use them.
A couple of days ago, I came across a dataset on figshare (a data sharing site) detailing the article processing charges (APCs) paid by the University of Portsmouth to publishers in 2014. After I casually (lazily…;-) remarked on the existence of this dataset via Twitter, Owen Stephens/@ostephens referred me to a JISC project that is looking at APCs in more detail, with prototype data explorer here: All APC demonstrator [Github repository].
The project looks as if it is part of Jisc Collections’ look at the Total Cost of Ownership in the context of academic publishing, summing things like journal subscription fees along side “article processing charges” (which I’d hope include page charges?).
If you aren’t in academia, you may not realise that what used to be referred to as ‘vanity publishing’ (paying to get your first novel or poetry collection published) is part of the everyday practice of academic publishing. But it isn’t called that, obviously, because your work also has to be peer reviewed by other academics… So it’s different. It’s “quality publishing”.
Peer review is, in part, where academics take on the ownership of the quality aspects of academic publishing, so if the Total Cost of Ownership project is trying to be relevant to institutions and not just to JISC, I wonder if there should also be columns in the costing spreadsheet relating to the work time academics spend reviewing other peoples’ articles, editing journals, and so on. This is different to the presentational costs, obviously, because you can’t just write paper and submit it, you have to submit it in an appropriately formatted document and “camera ready” layout, which can also add a significant amount of time to preparing a paper for publication. So you do the copy editing and layout too. And so any total costing to an academic institution of the research publishing racket should probably include this time too. But that’s by the by.
The data that underpins the demonstrator application was sourced from a variety of universities and submitted in spreadsheet form. A useful description (again via @ostephens) of the data model can be found here: APC Aggregation: Data Model and Analytical Usage. Looking at it it just seems to cover APCs.
APC data relating to the project can be found on figshare. I haven’t poked around in the demonstrator code or watched its http traffic to see if the are API calls on to the aggregated data that provide another way in to it.
As well as page charges, there are charges associated with subscription fees to publishers. Publishers don’t like this information getting out on grounds of commercial sensitivity, and universities don’t like publishing it presumably on grounds of bringing themselves into disrepute (you spend how much?!), but there is some information out there. Data from a set of FOI requests about journal subscriptions (summarised here), for example. If you want to wade through some of the raw FOI responses yourself, have a look on WhatDoTheyKnow: FOI requests: “journal costs”.
Tim Gowers also wrote compellingly about his FOI escapades trying to trying down journal subscription costs data: Elsevier journals – some facts.
This is all very well, but is it in anyway useful? I have no idea. One thing I imagined that might be quite amusing to explore was the extent to which journal subscriptions paid their way (or were “cost effective”). For example, looking at institutional logs, how often are (articles from) particular journals being accessed or downloaded either for teaching or research purposes? (Crudely: teaching – access comes from a student account; research – access from a research account.) On the other hand, for the research outputs of the institution, how many things are being published into a particular journal, and how many citations appear in those outputs to other publications.
If we take the line that use demonstrates value, and use is captured as downloads, publications into, or references into. (That’s very crude, but then I’m approaching this as a possible recreational data exercise, not a piece of formal research. And yes – I know, journals are often bundled up in subscription packages together, and just like Sky blends dross with desirable channels in its subscription deals, I suspect academic publishers do too… But then, we could start to check these based on whether particular journals in bundle are ever accessed, ever referenced, ever published into within a particular organisation, etc. Citation analysis can also help here – for example, if 5 journals all heavily cite each other, and one publisher publishes 3 of those, it could makes sense for them to bundle the journals two into one package and the third into another, so if you’re researching topics that are reported by heavily linked articles across those journals, you can essentially force people researching that topic into subscribing to both packages. Without having a look at citation network analyses and subscription bundles, I can’t check that outlandish claim of course;-)
Erm… that’s it…
PS see also Evaluating big deal journal bundles (via @kpfssport)
PPS for a view from the publishers’ side on the very real costs associated with publishing, as well as a view on how academia and business treat employment costs and “real” costs in rather contrasting ways, see Time is Money: Why Scholarly Communication Can’t Be Free.
Whilst preparing for what turned out to be a very enjoyable data at the BBC Data Data in Birmingham on Tuesday, where I ran a session on Open Refine [slides] I’d noticed that one of the transformations Open Refine supports is hashing using either MD5 or SHA-1 algorithms. What these functions essentially do is map a value, such as a name or personal identifier, on to what looks like a random number. The mapping is one way, so give the hash value of a name or personal identifier, you can’t go back to the original. (The way the algorithms work means that there is also a very slight possibility that two different original values will map on to the same hashed value which may in turn cause errors when analysing the data.)
We can generate the hash of values in a column by transforming the column using the formula md5(value) or sha1(value).
If I now save the data using the transformed (hashed) vendor name (either the SHA-1 hash or the MD5 hash), I can release the data without giving away the original vendor name, but whilst retaining the ability to identify all the rows associated with a particular vendor name.
One of the problems with MD5 and SHA-1 algorithms from a security point of view is that they run quickly. This means that a brute force attack can take a list of identifiers (or generate a list of all possible identifiers), run them through the hashing algorithm to get a set of hashed values, and then look up a hashed value to see what original identifier generated it. If the identifier is a fixed length and made from a fixed alphabet, the attacker can easily generate each possible identifier.
One way of addressing this problem is to just add salt… In cryptography, a salt (sic) is a random term that you add to a value before generating the hash value. This has the advantage that it makes life harder for an attacker trying a brute force search but is easy to implement. If we are anonymising a dataset, there are a couple of ways we can quickly generate a salt term. The strongest way to do this is to generate a column containing unique random numbers or terms as the salt column, and then hash on the original value plus the salt. A weaker way would be to use the values of one of the other columns in the dataset to generate the hash (ideally this should be a column that doesn’t get shared). Even weaker would be to use the same salt value for each hash; this is more akin to adding a password term to the original value before hashing it.
Unfortunately, in the first two approaches, if we create a unique salt for each row, this will break any requirement that a particular identifier, for example, is always encoded as the same hashed value (we need to guarantee this if we want to do analysis on all the rows associated with it, albeit with those rows identified using the hashed identifier). So when we generate the salt, we ideally want a unique random salt for each identifier, and that salt to remain consistent for any given identifier.
If you look at the list of available GREL string functions you will see a variety of methods for processing string values that we might be able to combine to generate some unique salt values, trusting that an attacker is unlikely to guess the particular combination we have used to create the salt values. In the following example, I generate a salt that is a combination of a “fingerprint” of the vendor name (which will probably, though not necessarily, be different for each vendor name, and add to it a secret fixed “password” term). This generates a consistent salt for each vendor name that is (probably) different from the salt of every other vendor name. We could add further complexity by adding a number to the salt, such as the length of the vendor name (value.length()) or the remainder of the length of the vendor name divided by some number (value.length()%7, for example, in this case using modulo 7).
Having generated a salt column (“Salt”), we can then create hash values of the original identifier and the salt value. The following shows both the value to be hashed (as a combination of the original value and the salt) and the resulting hash.
As well as masking identifiers, anonymisation strategies also typically require that items that can be uniquely identified because of their low occurrence in a dataset. For example, in an educational dataset, a particular combination of subjects or subject results might uniquely identify an individual. Imagine a case in which each student is given a unique ID, the IDs are hashed, and a set of assessment results is published containing (hashed_ID, subject, grade) data. Now suppose that only one person is taking a particular combination of subjects; that fact might then be used to identify their hashed ID from the supposedly anonymised data and associate it with that particular student.
OpenRefine may be able to help us identify possible problems in this respect by means of the faceted search tools. Whilst not a very rigorous approach, you could for example trying to query the dataset with particular combinations of facet values to see how easily you might be able to identify unique individuals. In the above example of (hashed_ID, subject, grade) data, suppose I know there is only one person taking the combination of Further Maths and Ancient Greek, perhaps because there was an article somewhere about them, although I don’t know what other subjects they are taking. If I do a text facet on the subject column and select the Further Maths and Ancient Greek values, filtering results to students taking either of those subjects, and I then create a facet on the hashed ID column, showing results by count, there would only be one hashed ID value with a count of 2 rows (one row corresponding to their Further Maths participation, the other to their participation in Ancient Greek. I can then invade that person’s privacy by searching on this hashed ID value to find out what other subjects they are taking.
Note that I am not a cryptographer or a researcher into data anonymisation techniques. To do this stuff properly, you need to talk to someone who knows how to do it properly. The technique described here may be okay if you just want to obscure names/identifiers in a dataset you’re sharing with work colleagues without passing on personal information, but it really doesn’t do much more than that.
PS A few starting points for further reading: Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization, Privacy, Anonymity, and Big Data in the Social Sciences and The Algorithmic Foundations of Differential Privacy.
Here are some quotes (taken from the book) that made sense to me in the context of process…
To begin with:
[Art’s] most important role is to make meaning. [p111]
So is what you’re doing meaningful to you?
As an artist, the ability to resist peer pressure, to trust one’s own judgement, is vital but it can be a lonely and anxiety-inducing procedure. [p18]
You can work ideas…
It’s … a great joy to learn a technique, because as soon as you learn it, you start thinking in it. When I learn a new technique my imaginative possibilities have expanded. Skills are really important to learn; the better you get at a skill, the more you have confidence and fluency. [p122]
Artists have historically worked technology…
In the past artists were the real innovators of technology. [p100]
…but can we work technology, as art?
The metaphor that best describes what it’s like for my practice as an artist is that of a refuge, a place inside my head where I can go on my own and process the world and its complexities. It’s an inner shed in which I can lose myself. [p131]
I said the final farewell to a good friend and great social technology innovator, Ches Lincoln, who died on Christmas Eve, last week. In one project we worked together on, an online learning space, we created a forum called “The Shed” for the those learners who wanted to get really geeky and techie in their questions and discussions, and whose conversations risked scaring off the majority.
The sound a box of Lego makes is the noise of a child’s mind working, looking for the right piece. [p116]
For some reason, when I first saw this:
it reminded me of Hockney. The most recent thing I’ve read, or heard, about Hockney was a recent radio interview (BBC: At Home With David Hockney), in the opening few minutes of which we hear Hockney talking about photographs, how the chemical period of photography ,and its portrayal of perspective, is over, and how a new age of digital photography allows us to reimagine this sort of image making.
Hockney also famously claimed that lenses have played an important part in the craft of image making Observer: What the eye didn’t see…. Whether or not people have used lenses to create particular images, the things seen through lenses and the idea of how lenses may help us see the world differently may in turn influence the way we see other things, either in the mind’s eye, in where we focus attention, or in how we construct or compose an image of our own making.
As the above animated image shows, when a lens is combined with a logical transformation (that may be influenced by a socio-cultural and/or language context acting as a filter) of the machine interpretation of a visual scene represented as bits, it’s not exactly clear what we do see…!
As Hockney supposedly once said, “[t]he photograph isn’t good enough. It’s not real enough”.