OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Paying for Dropbox and Other Useful Bits… (The Cost of Doing Business…)

with one comment

A couple of years ago or so, Dropbox ran a promotion for academic users granting 15GB of space. Yesterday, I got an email:

As part of your school’s participation in Space Race, you received 15 GB of additional Dropbox space. The Space Race promotional period expires on March 4, 2015, at which point your Dropbox limit will automatically return to 5 GB.

As a friendly reminder, you’re currently using 14.6 GB of Dropbox space. If you’re over your 5 GB limit after March 4, you’ll no longer be able to save new photos, videos, and docs to Dropbox.

Need more space? Dropbox Pro gives you 1 TB of space to keep everything safe, plus advanced sharing controls, remote wipe for lost devices, and priority support. Upgrade before March 4 and we’ll give you 30% off your first year.

My initial thought was to tweet:

but then I thought again… The discounted price on a monthly payment plan is £5.59/month which on PayPal converted this month to $8.71. I use Dropbox all the time, and it forms part of my workflow for using Leanpub. As it’s the start of the month, I received a small royalty payment for the Wrangling F1 Data With R book. The Dropbox fee is about amount I’m getting per book sold, so it seems churlish not to subscribe to Dropbox – it is part of the cost of doing business, as it were.

The Dropbox subscription gets me 1TB, so this also got me thinking:

  • space is not now an issue, so I can move the majority of my files to Dropbox, not just a selection of folders;
  • space is not now an issue, so I can put all my github clones into Dropboxl
  • space is now now an issue, so though it probably goes against terms of service, I guess I could set up toplevel “family member” folders and we could all share the one subscription account, just selectively synching our own folders?

In essence, I can pretty much move to Dropbox (save for those files I don’t want to share/expose to US servers etc etc; just in passing, one thing Dropbox doesn’t seem to want to let me do is change the account email to another email address that I have another Dropbox account associated with. So I have a bit of an issue with juggling accounts…)

When I started my Wrangling F1 Data With R experiment, the intention was always to make use of any royalties to cover the costs associated with that activity. Leanpub pays out if you are owed more than $40 collected in the run up to 45 days ahead of a payment date (so the Feb 1st payout was any monies collected up to mid-December and not refunded since). If I reckon on selling 10 books a month, that gives me about $75 at current running. Selling 5 a month (so one a week) means it could be hit or miss whether I make the minimum amount to receive a payment for that month. (I could of course put the price up. Leanpub lets you set a minimum price but allows purchasers to pay what they want. I think $20 is the highest amount paid for a copy I’ve had to date, which generated a royalty of $17.50 (whoever that was – thank you :-)) You can also give free or discounted promo coupons away.) As part of the project is to explore ways of identifying and communicating motorsport stories, I’ve spent royalties so far on:

  • a subscription to GP+ (not least because I aspire to getting a chart in there!;-);
  • a subscription to the Autosport online content, in part to gain access to forix, which I’d forgotten is rubbish;
  • a small donation to sidepodcast, because it’s been my favourite F1 podcast for a long time.

Any books I buy in future relating to sports stats or motorsport will be covered henceforth from this pot. Any tickets I buy for motorsport events, and programmes at such events, will also be covered from this pot. Unfortunately, the price of an F1 ticket/weekend is just too much. A Sky F1 Channel subscription or day passes is also ruled out because I can’t for the life of me work out how much it’ll cost or how to subscribe; but I suspect it’ll be more than the £10 or so I’d be willing to pay per race (where race means all sessions in a race weekend). If my F1 iOS app subscription needs updating that’ll also count. Domain name registration (for example, I recently bought f1datajunkie.com) is about £15/$25 a year from my current provider. (Hmm, that seems a bit steep?) I subscribe to Racecar Engineering (£45/$70 or so per year), the cost of which will get added to the mix. A “big ticket” item I’m saving for (my royalties aren’t that much) on the wants list is a radio scanner to listen in to driver comms at race events (I assume it’d work?). I’d like to be able to make a small regular donation to help keep the ergast site on, but can’t see how to… I need to bear in mind tax payments, but also consider the above as legitimate costs of a self-employed business experiment.

I also figure that as an online publishing venture, any royalties should also go to supporting other digital tools I make use of as part of it. Some time ago, I bought in to the pinboard.in social bookmarking service, I used to have a flickr pro subscription (hmm, I possibly still do? Is there any point…?!) and I spend $13 a year with WordPress.com on domain mapping. In the past I have also gone ad-free ($30 per year). I am considering moving to another host such as Squarespace ($8 per month), because WordPress is too constraining, but am wary of what the migration will involve and how much will break. Whilst self-hosting appeals, I don’t want the grief of doing my own admin if things go pear shaped.

I’m a heavy user of RStudio, and have posted a couple of Shiny apps. I can probably get by on the shinyapps.io free plan for a bit (10 apps) – just – but the step up to the basic plan at $39 a month is too steep.

I used to use Scraperwiki a lot, but have moved away from running any persistent scrapers for some time now. morph.io (which is essentially Scraperwiki classic) is currently free – though looks like a subscription will appear at some point – so I may try to get back into scraping in the background using that service. The Scraperwiki commercial plan is $9/month for 10 scrapers, $29 per month for 100. I have tended in the past to run very small scrapers, which means the number of scrapers can explode quickly, but $29/month is too much.

I also make use of github on a free/open plan, and while I don’t currently have any need for private repos, the entry level micro-plan ($7/month) offers 5. I guess I could use a (private?) github rather than Dropbox for feeding Leanpub, so this might make sense. Of course, I could just treat such a subscription as a regular donation.

It would be quite nice to have access to IPython notebooks online. The easiest solution to this is probably something like wakari.io, which comes in at $25/month, which again is a little bit steep for me at the moment.

In my head, I figure £5/$8/month is about one book per month, £10/$15 is two, £15/$20 is three, £25/$40 is 5. I figure I use these services and I’m making a small amount of pin money from things associated with that use. To help guarantee continuity in provision and maintenance of these services, I can use the first step of a bucket brigade style credit apportionment mechanism to redistribute some of the financial benefits these services have helped me realise.

Ideally, what I’d like to do is spend royalties from 1 book per service per month, perhaps even via sponsored links… (Hmm, there’s a thought – “support coupons” with minimum prices set at the level to cover the costs of running a particular service for one month, with batches of 12 coupons published per service per year… Transparent pricing, hypothecated to specific costs!)

Of course, I could also start looking at running my own services in the cloud, but the additional time cost of getting up and running, as well as hassle of administration, and the stress related to the fear of coping in the face of attack or things properly breaking, means I prefer managed online services where I use them.

Written by Tony Hirst

February 3, 2015 at 10:58 am

Posted in Anything you want

Tagged with ,

OpenRefine Style Reconciliation Containers

leave a comment »

Over the weekend, I rediscovered Michael Bauer/@mihi_tr’s Reconcile CSV [code] service that builds an OpenRefine reconciliation service on top of a CSV file. One column in the CSV file contains a list of values that you want to reconcile (that is, fuzzy match) against, the other is a set of key identifier values associated with the matched against value.

Having already popped OpenRefine into a docker container, I thought I’d also explore dockerising Michael’s service: docker-reconciliation.

The default container uses a CSV file of UK MP names (current and previous) and returns their full title and an identifier used in the UK Parliament Members’ names data platform.

To run service in boot2docker:

  • docker run -p 8002:8000 --name mprecon -d psychemedia/docker-reconciliation
  • boot2docker ip to get the IP address the service is running on, eg 192.168.59.103
  • Test the service in your browser: http://192.168.59.103:8002/reconcile?query=David Cameroon

In OpenRefine, set the reconciliation service URL to http://192.168.59.103:8002/reconcile.

NOTE: I had thought I should be able to fire up linked OpenRefine and ReconcileCSV containers and address more conveniently, for example:

docker run --name openrefiner -p 3335:3333 --link mprecon:mprecon -d psychemedia/openrefine

and then setting something like http://mprecon:8000/reconcile as the reconciliation service endpoint, but that didn’t seem to work? Instead I had to use the endpoint routed to host (http://192.168.59.103:8002/reconcile).

I also added some command line parameters to the script so that you can fire up the container and reconcile against your own CSV file:

docker run -p 8003:8000 -v /path/to/:/tmp/import -e RECONFILE=myfile.csv -e SEARCHCOL=mysearchcol -e IDCOL=myidcol --name recon_mycsv -d psychemedia/docker-reconciliation

This loads in the file on your host computer at /path/to/myfule.csv using the column named mysearchcol for the search/fuzzy match values and the column named myidcol for the identifiers.

It struck me that I could then commit this customised container as a docker image, and push it to dockerhub as a tagged image. Permissions mean I can’t push to the original trusted/managed repository that builds containers from my github repo, but I can create a new dockerhub repository containing tagged images. For example:

docker commit recon_mycsv psychemedia/docker-reconciler:recon_mycsv
docker push psychemedia/docker-reconciler:recon_mycsv

This means I can collect a whole range of reconciliation services, each independently tagged, at psychemedia/docker-reconciler – tags.

So for example:

  • docker run --name reconcile_ukmps -p 8008:8000 -d psychemedia/docker-reconciler:ukmps_pastpresent runs a reconciliation service agains UK past and present MPs on port 8008;
  • docker run --name reconcile_westminster -p 8009:8000 -d psychemedia/docker-reconciler:westminster_constituency runs a reconciliation service against Westminster constituencies on port 8009.

In practice the current reconciliation service only seems to work well on small datasets, up to a few thousand lines, but nonetheless it can still be useful to be able to reconcile against such datasets. For larger files – such as the UK Companies House register, where we might use registered name for the search column and company number for the ID – it seems to take a while…! (For this latter example, a reconciliation service already exists at OpenCorporates.)

One problem with the approach I have taken is that the data file is mounted within the reconciliation server container. It would probably make more to sense have the RefineCSV container mount a data volume containing the CSV file, so that we can then upgrade the reconciliation server container once and then just link it to data containers. As it is, with the current model, we would have to rebuild each tagged image separately to update the reconciliation server they use.

Unfortunately, I don’t know of an easy way to package up data volume containers (an issue I’ve also come up against with database data stores). What I’d like to be able to do is have a simple “docker datahub” that I could push data volumes to, and then be able to say something like docker run -d --volumes-from psychemedia/reconciliation-data:westminster_constituency --name recon_constituencies psychemedia/reconciliation. Here, --volumes-from would look up data volume containers on something like registry.datahub.docker.com and psychemedia/reconciliation from registry.hub.docker.com.

So where’s all this going, and what next? First up, it would be useful to have an official Dockerfile that builds Michael’s Reconcile CSV server. (It would also be nice to see an example of a Python based reconciliation server – because I think I might be able to hack around with that! [UPDATE – there is one here that I forked here and dockerised here]) Secondly, I really need to find a way through the portable data volume container mess. Am I missing something obvious? Thirdly, the reconciliation server needs a bit of optimisation so it can work with larger files, a fast fuzzy match of some sort. (I also wonder whether a lite reconciliation wrapper for PostgreSQL would be useful that can leverage the PostgreSQL backend and fuzzy search plugin to publish a reconciliation service?)

And what’s the payoff? The ability to quickly fire up multiple reconciliation services against reference CSV documents.

Written by Tony Hirst

February 2, 2015 at 1:49 pm

Posted in OpenRefine, Tinkering

Tagged with

Defining Environment Variables Indirectly in bash

leave a comment »

I spent a chunk of time this morning engaged in what ended up being something of a red herring, but learning was involved along the way, so here it is… how to set an environment variable indirectly in a bash shell.

Suppose I have a variable TAG=key and a variable VARVAL=thisval.

#Set key_val=$VARVAL
eval ${TAG}_val=\$VARVAL

#Export key_val=$VARVAL
export eval ${TAG}_val=\$VARVAL

Now suppose I want to test if $TAG exists, and further whether it is set to the same value as $CURRTAG. The ${TAG:+1} tests whether that TAG variable exists and that it is not empty. The -a is a logical AND.

if [ -n "${TAG:+1}" -a "$TAG" != "$CURRTAG" ]; then
    tmpf=${TAG}_val
    export VARVAL=${!tmpf}
    export CURRTAG=$TAG
fi

Erm, I think… I realised this wouldn’t actually be appropriate for the context I had in mind so never fully tested it…

Written by Tony Hirst

February 2, 2015 at 12:02 pm

Posted in Tinkering

Tagged with

Rediscovering Formula One Race Battlemaps

leave a comment »

A couple of days ago, I posted a recipe on the F1DataJunkie blog that described how to calculate track position from laptime data.

Using that information, as well as additional derived columns such as the identity of, and time to, the cars immediately ahead of and behind a particular selected driver, both in terms of track position and race position, I revisited a chart type I first started exploring several years ago – race battle charts.

The main idea behind the battlemaps is that they can help us search for stories amidst the runners.

dirattr=function(attr,dir='ahead') paste(attr,dir,sep='')

#We shall find it convenenient later on to split out the initial data selection
battlemap_df_driverCode=function(driverCode){
  lapTimes[lapTimes['code']==driverCode,]
}

battlemap_core_chart=function(df,g,dir='ahead'){
  car_X=dirattr('car_',dir)
  code_X=dirattr('code_',dir)
  factor_X=paste('factor(position_',dir,'<position)',sep='')
  code_race_X=dirattr('code_race',dir)
  if (dir=='ahead') diff_X='diff' else diff_X='chasediff'
  
  if (dir=='ahead') drs=1000 else drs=-1000
  g=g+geom_hline(aes_string(yintercept=drs),linetype=5,col='grey')
  
  #Plot the offlap cars that aren't directly being raced
  g=g+geom_text(data=df[df[dirattr('code_',dir)]!=df[dirattr('code_race',dir)],],
                aes_string(x='lap',
                  y=car_X,
                  label=code_X,
                  col=factor_X),
              angle=45,size=2)
  #Plot the cars being raced directly
  g=g+geom_text(data=df,
                aes_string(x='lap',
                  y=diff_X,
                  label=code_race_X),
              angle=45,size=2)
  g=g+scale_color_discrete(labels=c('Behind','Ahead'))
  g+guides(col=guide_legend(title='Intervening car'))
  
}

battle_WEB=battlemap_df_driverCode('WEB')
g=battlemap_core_chart(battle_WEB,ggplot(),'ahead')
battlemap_core_chart(battle_WEB,g,dir='behind')

In this first sketch, from the 2012 Australian Grand Prix, I show the battlemap for Mark Webber:

battlemaps-unnamed-chunk-12-1

We see how at the start of the race Webber kept pace with Alonso, albeit around about a second behind, at the same time as he drew away from Massa. In the last third of the race, he was closely battling with Hamilton whilst drawing away from Alonso. Coloured labels are used to highlight cars on a different lap (either ahead (aqua) or behind (orange)) that are in a track position between the selected driver and the car one place ahead or behind in terms of race position (the black labels). The y-axis is the time delta in milliseconds between the selected car and cars ahead (y > 0) or behind (y < 0). A dashed line at the +/- one second mark identifies cars within DRS range.

As well as charting the battles in the vicinity of a particular driver, we can also chart the battle in the context of a particular race position. We can reuse the chart elements and simply need to redefine the filtered dataset we are charting.

For example, if we filter the dataset to just get the data for the car in third position at the end of each lap, we can then generate a battle map of this data.

battlemap_df_position=function(position){
  lapTimes[lapTimes['position']==position,]
}

battleForThird=battlemap_df_position(3)

g=battlemap_core_chart(battleForThird,ggplot(),dir='behind')+xlab(NULL)+theme_bw()
g=battlemap_core_chart(battleForThird,g,'ahead')+guides(col=FALSE)
g

battlemaps-postionbattles-1

For more details, see the original version of the battlemap chapter. For updates to the chapter, I recommend that you invest in a copy Wrangling F1 Data With R book if you haven’t already done so:-)

Written by Tony Hirst

January 31, 2015 at 8:52 pm

Posted in Rstats

Tagged with ,

Who Pays for Academic Publishing? Some Data Trails…

with 3 comments

A couple of days ago, I came across a dataset on figshare (a data sharing site) detailing the article processing charges (APCs) paid by the University of Portsmouth to publishers in 2014. After I casually (lazily…;-) remarked on the existence of this dataset via Twitter, Owen Stephens/@ostephens referred me to a JISC project that is looking at APCs in more detail, with prototype data explorer here: All APC demonstrator [Github repository].

The project looks as if it is part of Jisc Collections’ look at the Total Cost of Ownership in the context of academic publishing, summing things like journal subscription fees along side “article processing charges” (which I’d hope include page charges?).

If you aren’t in academia, you may not realise that what used to be referred to as ‘vanity publishing’ (paying to get your first novel or poetry collection published) is part of the everyday practice of academic publishing. But it isn’t called that, obviously, because your work also has to be peer reviewed by other academics… So it’s different. It’s “quality publishing”.

Peer review is, in part, where academics take on the ownership of the quality aspects of academic publishing, so if the Total Cost of Ownership project is trying to be relevant to institutions and not just to JISC, I wonder if there should also be columns in the costing spreadsheet relating to the work time academics spend reviewing other peoples’ articles, editing journals, and so on. This is different to the presentational costs, obviously, because you can’t just write paper and submit it, you have to submit it in an appropriately formatted document and “camera ready” layout, which can also add a significant amount of time to preparing a paper for publication. So you do the copy editing and layout too. And so any total costing to an academic institution of the research publishing racket should probably include this time too. But that’s by the by.

The data that underpins the demonstrator application was sourced from a variety of universities and submitted in spreadsheet form. A useful description (again via @ostephens) of the data model can be found here: APC Aggregation: Data Model and Analytical Usage. Looking at it it just seems to cover APCs.

APC data relating to the project can be found on figshare. I haven’t poked around in the demonstrator code or watched its http traffic to see if the are API calls on to the aggregated data that provide another way in to it.

As well as page charges, there are charges associated with subscription fees to publishers. Publishers don’t like this information getting out on grounds of commercial sensitivity, and universities don’t like publishing it presumably on grounds of bringing themselves into disrepute (you spend how much?!), but there is some information out there. Data from a set of FOI requests about journal subscriptions (summarised here), for example. If you want to wade through some of the raw FOI responses yourself, have a look on WhatDoTheyKnow: FOI requests: “journal costs”.

Tim Gowers also wrote compellingly about his FOI escapades trying to trying down journal subscription costs data: Elsevier journals – some facts.

Other possible sources include a search engine that allows you to rank journals by price per article or citation (data and information sources).

This is all very well, but is it in anyway useful? I have no idea. One thing I imagined that might be quite amusing to explore was the extent to which journal subscriptions paid their way (or were “cost effective”). For example, looking at institutional logs, how often are (articles from) particular journals being accessed or downloaded either for teaching or research purposes? (Crudely: teaching – access comes from a student account; research – access from a research account.) On the other hand, for the research outputs of the institution, how many things are being published into a particular journal, and how many citations appear in those outputs to other publications.

If we take the line that use demonstrates value, and use is captured as downloads, publications into, or references into. (That’s very crude, but then I’m approaching this as a possible recreational data exercise, not a piece of formal research. And yes – I know, journals are often bundled up in subscription packages together, and just like Sky blends dross with desirable channels in its subscription deals, I suspect academic publishers do too… But then, we could start to check these based on whether particular journals in bundle are ever accessed, ever referenced, ever published into within a particular organisation, etc. Citation analysis can also help here – for example, if 5 journals all heavily cite each other, and one publisher publishes 3 of those, it could makes sense for them to bundle the journals two into one package and the third into another, so if you’re researching topics that are reported by heavily linked articles across those journals, you can essentially force people researching that topic into subscribing to both packages. Without having a look at citation network analyses and subscription bundles, I can’t check that outlandish claim of course;-)

Erm… that’s it…

PS see also Evaluating big deal journal bundles (via @kpfssport)

PPS for a view from the publishers’ side on the very real costs associated with publishing, as well as a view on how academia and business treat employment costs and “real” costs in rather contrasting ways, see Time is Money: Why Scholarly Communication Can’t Be Free.

Written by Tony Hirst

January 30, 2015 at 2:50 pm

Adding Metadata to Google Docs

leave a comment »

A couple of months ago I had started working on an export tool that would export a Google doc in the OU-XML format. The rationale? The first couple of drafts of the teaching material that will be delivered through the VLE in the forthcoming (October, 2015) OU Data management and analysis course (TM351) have been prepared in Google docs, and the production process will soon have to move to the Open University’s XML workflow. This workflow is built around an OU defined schema, often referred to as OU-XML (or OUXML), and is supported by a couple of oXygen XML editor extensions that make it easy to preview rendered versions of the documents in a test VLE site.

The schema itself includes several elements that are more akin to metadata elements than actual content – things like the course code, course title, for example, or the byline (or lead author) of a particular unit.

Support for a small amount of metadata is provided by Google Drive, but the only easily customisable element is a free text description element.

gdocsMetadata

So whilst patching a couple of “issues” today with the Google Docs to OU-XML generator, and adding a menu option that allows users to create a zip file in Google Drive that contains the OU-XML and any associated image files for a particular Google doc, I thought it might also be handy to add some support for additional metadata elements. Google Drive apps support a Properties class that allows metadata properties represented as key-value pairs to be associated with a particular document, user or script. Google Apps Script can be used to set and retrieve these properties. In addition, Google Apps Script can be used to generate templated HTML user interface forms that can be used to extend Google docs or spreadsheets functionality.

In particular, I created a handful of Google Apps Script functions to pop up a templated panel, save metadata descriptions entered into the metadata form as document properties, and retrieve the value of a particular metadata element.

//Pop up the metadata edit/display panel
//The document is created as a templated HTML document
function metadataView() {
  // Generate the HTML
  html= HtmlService
      .createTemplateFromFile('metadata')
      .evaluate()
      .setSandboxMode(HtmlService.SandboxMode.IFRAME);
  //Pop up a panel and render the HTML describing the metadata form inside it
  DocumentApp.getUi().showModalDialog(html, 'Metadata');
}

//This function sets the document properties from the metadata form elements
function processMetadataForm(theForm) {
  var props=PropertiesService.getDocumentProperties()
  //Process each form element (atm, they are just input text elements)
  for (var item in theForm) {
    props.setProperty(item,theForm[item])
    Logger.log(item+':::'+theForm[item]);
  }
}

The templated HTML form is configured using a set of desired metadata elements. Each element is described using a label that is displayed in the form, an attribute name which should be a single word) and an optional default value. The template also demonstrates how we can call a server side Apps Script function from the dialogue using the google.script.run.FUNCTION_NAME construction.

<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
  
<? 
//Add metadata fields here in the following format:
//[Label, a unique identifier (unique word, no spaces or punctuation), an optional default value]
var metadataItems =[
    ["Lead Author","leadAuthor"],
    ["Course Code","courseCode"],
    ["Course Title","courseTitle"],
    ["Unit Title","unitTitle"],
    ["Rendering","rendering","VLE2 staff (learn3)"]
]
?>
  
<? var metadata = PropertiesService.getDocumentProperties() ?>
<script>
//When the metadata has been successfully saved as document properties
//  close the metadata form panel
function onSave() {google.script.host.close()}
</script>
  
<form id='metadataForm'>
<!-- Construct a set of form elements, one for each metadata item -->
<? for (var i = 0; i < metadataItems.length; i++) { ?>
  <div><?= metadataItems[i][0] ?>: 
    <input type="text"
      name = "<?= metadataItems[i][1] ?>"
      <? val=''
        if (metadataItems[i].length>2) val= metadataItems[i][2]  ?>
      value= "<?= metadata.getProperty(metadataItems[i][1]) ? metadata.getProperty(metadataItems[i][1])  : val  ?>"
    /> 
  </div>
<? } ?>
    
</form>
  
<div>
  <input
    type="button"
    value="Save & Close"
    onclick="google.script.run.withSuccessHandler(onSave).processMetadataForm(document.getElementById('metadataForm'))"
  />
  
  <input
    type="button"
    value="Cancel"
    onclick="google.script.host.close()"
  />
</div>

When the metadataView() function is called from the Add-Ons menu, it pops a dialogue that looks (in unstyled form) something like this:

googleDocMetadata

Metadata elements are loaded in to the form if they exist or a default value is specified.

When generating the export OU-XML, a helper function grabs the value of the relevant metadata element from the document properties. This value then then be inserted into the OU XML at the appropriate point.

//A helper function to display a particular metadata element
//This function is called from the metadata form
function getProp(key) {
  var props= PropertiesService.getDocumentProperties()
  return props.getProperty(key) ? props.getProperty(key) : '';
}

var COURSECODE= getProp('courseCode');

One issue with this approach is that if we have lots of documents relating to different units for the same course, we may need to enter the same values for several metadata elements across each document (for example, the course code and course title). Unfortunately, Google Drive does not support arbitrary properties for folders. One solution, suggested by Tom Smith/@everythingabili was to use the description element for a folder to store JSON represented metadata. I think we could actually simplify that, using a line based representation or a simple delimited representation that we can easily split on, something like:

courseCode :: TM351;;
courseTitle:: Data Management and Analysis

for example. We could then split on ;; to get each pair, strip whitespace, split on :: and strip whitespace again to get the key:value elements for each metadata item.

gdocsfoldermetadata

I guess one way of getting the folder decription given a particular document as a starting point is to find the parent folder using file#getParents() perhaps?) and then call folder#getDescription()?

Another approach might be to have a dummy, canonically named file in each folder (metadata for example), that we add metadata to, and then whenever we open a new file in the folder we look for the metadata file, get its metadata property values, and use those to seed the metadata values for our content document.

Finally, it’s maybe worth also pondering the issue of generating the OU-XML export for all the documents within a given folder? One way to do this might be to create a function off a each document that will find the parent folder, find all the files (except, perhaps, a metadata file?!) in that folder, and then run the OU-XML generator over all of them, bundling them up into a single zip file, perhaps with a directory structure that puts the OU XML for each document, along with any image files associated with it, into separate folders?

Only it probably isn’t.. I suspect that if the migration to the OU-XML format, if it hasn’t already happened, will involve copying and pasting…

PS for completeness, the menu option can be installed as follows:

function onOpen(e) {
  DocumentApp.getUi().createAddonMenu()
      .addItem('Metadata','metadataView')
      .addToUi();
}

Written by Tony Hirst

January 28, 2015 at 10:23 pm

Posted in Tinkering

Tagged with , ,

Data Analysis Packages…?

with one comment

Chasing the thought of Frictionless Data Analysis – Trying to Clarify My Thoughts, I wonder: how about if, in addition to the datapackage.json specification, there was a data analysis package or data analysis toolkit package specification? Perhaps the latter might be something that unpacks rather like the fig.yml file described in Using Docker to Build Linked Container Course VMs, and the former a combination of a datapackage and a data analysis toolkit package, that downloads a datapackage and opens it into a toolkit configuration specified by data analysis toolkit package. We’d perhaps also want to be able to define a set of data analysis scripts (data analysis script package???) relevant to working with a particular datapackage in the specified tools (for example, some baseline IPython notebooks or R/Rmd scripts?)

Written by Tony Hirst

January 28, 2015 at 12:21 am

Posted in Thinkses

Follow

Get every new post delivered to your Inbox.

Join 1,349 other followers