Converting between VM Formats

Trying to get our current TM351 VirtualBox virtual machine into a raw format that we can run on OpenStack… how many times did I go round the houses failing to discover that VBoxManage has a conversion tool (VBoxManage clonemedium) taking something like the form:

VBoxManage clonemedium ~/VirtualBox\ VMs/tm351_18J-student/box-disk001.vmdk tm351_18J-student.raw --format RAW

Export  / conversion / clone formats include: VDI, VMDK, VHD, RAW. There’s also “other”, but I’m not sure what that entails.

This might also be handy: Convert VDI (VirtualBox) to raw, qcow2, qed, vmdk, vhd in Windows.

The Library as the Natural Home for Emerging Technology…

I was back in the Library today after waaaay too long away to give a staff development session on things related to virtual machines, docker, “digital application shelves”, Jupyter notebooks and reproducible educational materials. (We also tried a bit of consensus humming… :-)

In conversation afterwards, we briefly chatted about the Library as being a possible home for providing such services, then over coffee with Richard Nurse riffed fleetingly on the idea of a Digital Skills Lab, which is a phrase that has been sitting with me all day since…

In my second public outing of the day, a conversation with Stpehen Downes for his e-learning 3.0 MOOC, I riffed on Docker and notebooks again, and whilst chatting after that event riffed casually on the notion of using Docker as a means of delivering personal productivity apps / information tools, and why the Library, rather than IT, might be better suited to supporting such an offering…

Between the two, I tried to hijack a server we’ve acquired to explore some infrastructure experiments to support delivery of Institute of Coding activities that the OU has a work package to deliver and give it to the Library… Here are some quick thoughts relating to a possible case for the defence I may have to make tomorrow…

  • to explore useful infrastructure offerings for supporting coding related education, we need to consider: 1) the environments that are user (learner and/or researcher) facing; 2) the architecture that lets us scale that offering;
  • the original server was supposed to satisfy both need, but the lack of resource to develop the scaling infrastructure part was blocking the end-user development work;
  • grabbing a server, situating it in the Library, and calling it a Digital Skills Lab development server makes a statement about the sorts of things we might want to use it for. Specifically:
  • utility of running experimental Jupyter notebook servers so people can start to explore their own use of notebooks, notebook environment customisations using extensions, etc;
  • utility of  running a local lab docker hub “digital application shelf” and docker machine to let folk check out and run pre-built “digital applications” (i.e. prebuilt Docker containers) taken off the shelf;
  • utility of running a local Binderhub to let folk explore building their own pre-configured computational environment + and distribute it as a live environment with notebook content that exploits that environment;
  • developing a lab mentality as a space / server where folk can try stuff out, and bring queries and requests as well as volunteering in their own ideas;
  • situating it in the Library means it’s not a STEM computing thing: it’s accessible all faculties;
  • more specifically, taking it out of the Computing Department and STEM Faculty makes a statement that we’re trying to offer computation stuff to people in general, not provide a computing environment for computing people per se; that is, we can explore, and maybe even help develop, a different set of expectations and use models for “code” – not necessarily writing big programs, but perhaps just finding the single-line-of-code-at-a-time that helps you complete a particular task.

Anyway – that was today… we’ll have to see what tomorrow’s email returns bring to see how much trouble I’m in!

PS By the by, waiting for a boat home, a most enjoyable piece by Tim Harford appeared in my streams: Why big companies squander brilliant ideas. Heh, heh… ;-)

“Tracking Jupyter” Newsletter

The pace of change associated with the Jupyter ecosystem, the variety of notebook examples published daily across a wide range of disciplines and domains, the increasing use of notebooks in industry, the creativity of extension writers, the range of hosted solutions and hosting providers, let alone the technical and engineering issues associated with designing and deploying Jupyter environments means it can be hard to keep up…

…so with the Tracking Jupyter newsletter, I’ll try to produce an ongoing round up of Jupyter related news and announcements that I’ve managed to spot over the previous week or two…

…ish.

Topics are likely to include:

    • official Jupyter announcements and releases;
    • Jupyter in education;
    • Jupyter in research;
    • Jupyter in industry;
    • new kernels and widget walkthroughs;
    • interesting notebooks and use cases (for example, notebooks behind news stories, notebooks demonstrating work in a particular topic area from computational sciences to digital humanities);
    • hosted solutions;
    • hosting and infrastructure (technical / engineering) solutions;
    • jobs.

Contributions / suggestions for news items are welcome (email: tony.hirst@open.ac.uk).

To get a feeling for what the newsletter might include, the first issue is available here: Tracking Jupyter: Newsletter, the First…

Subscribe here: Tracking Jupyter signup.

(In)Distinguishable from Magic…

A classic physics experiment showing a magical physical world effect – the inverted water cup…

With a little bit of science/physics knowledge, nothing is hidden and the effect is explainable (how it works). No tricks, in other words. The trick is not only self-working, it’s also transparent. Scientific knowledge is the key to the secret.

But are the safety glasses really necessarily? Really?

Here’s the same trick, as magic:

Gimmicks…

The same physics are at work but there’s a hidden element.

There’s also a risk here that people think there is a physics explanation for the trick )(surface tension of water, for example) and the magic leaves them with a misplaced confidence or understanding of the physics…

(Penn and Teller riff on this by showing how a trick is done, breaking the secret, then rerunning the trick – with the same overall effect – but in a way that doesn’t use the secret, thus reinstilling the magic for people who think they know the secret.)

When Arthur C. Clarke wrote “Any sufficiently advanced technology is indistinguishable from magic”, which sort of magic was he referring to? The application of gimmicks, the application of trickery? Or the application of mechanisms that are transparent.

You’ve Been Shared… And Your DNA Is Likely Out There…

Via Bruce Schneier (How DNA Databases Violate Everyone’s Privacy), a paper in Science by Ehrlich et al. (Identity inference of genomic data using long-range familial searches)and related news article (Genome hackers show no-one’s DNA is anonymous anymore) showing how your DNA is likely out there thanks to others sharing related DNA on… From the paper abstract:

Consumer genomics databases have reached the scale of millions of individuals. Recently, law enforcement authorities have exploited some of these databases to identify suspects via distant familial relatives. Using genomic data of 1.28 million individuals tested with consumer genomics, we investigated the power of this technique. We project that about 60% of the searches for individuals of European-descent will result in a third cousin or closer match, which can allow their identification using demographic identifiers.

Reminds me of a BBC Radio 4 play I caught a fragment of a week or so ago: a character was identified through his DNA by police, not because his DNA was on record, but that of his son was. DNA + the laws of genetics means that relationships can also be inferred.

From the news article, another paper, this time by Kim et al. (Statistical Detection of Relatives Typed with Disjoint Forensic and Biomedical Loci).

But first, to set the scene, an earlier paper referenced from that one by Edge et al [Linkage disequilibrium matches forensic genetic records to disjoint genomic marker sets]:

With the increasing abundance of genetic data, the usefulness of a genetic dataset now depends in part on the possibility of productively linking it with other datasets. … Such efforts magnify the value of genetic datasets without requiring coordinated genotyping.

One issue that arises in combining multiple datasets is the record-matching problem: the identification of dataset entries that, although labeled differently in different datasets, represent the same underlying entity (67). In a genetic context, record matching involves the identification of the same individual genome across multiple datasets when unique identifiers, such as participant names, are unavailable. This task is relatively simple when large numbers of SNPs are shared between marker sets: if records from different datasets match at enough of the shared SNPs, then they can be taken to represent the same individual.

What if no markers are shared between two genetic datasets? Can genotype records that rely on disjoint sets of markers be linked? Genetic record matching with no overlapping markers has many potential uses. Datasets could become cross-searchable even if no effort has been made to include shared markers in different marker sets. Record matching between new and old marker sets could determine whether an individual typed with a new set has appeared in earlier data, thereby facilitating deployment of new marker sets that are backward-compatible with past sets.

The presence of linkage disequilibrium (LD)—nonindependence of genotypes at distinct markers, primarily those that are proximate on the genome—can enable record matching without shared markers. As a result of LD between markers in different datasets, certain genotype pairs are more likely to co-occur, so that some potential record pairings are more likely than others.

Now back to the Kim et al paper:

Forensic DNA testing sometimes seeks to identify unknown individuals through familial searching, or relatedness profiling. When no exact match of a query DNA profile to a database of profiles is found, investigators can potentially test for a partial match to determine whether the query profile might instead represent a close relative of a person whose profile appears in the database (Bieber et al., 2006; Gershaw et al., 2011; Butler, 2012). A positive test leads investigators to consider relatives of the person with the partial match as possible contributors of the query profile.

Familial searching expands the potential to identify unknown contributors beyond the level achieved when searching exclusively for exact database matches. The larger set of people accessible to investigators—database entrants, plus their relatives—can increase the probability that the true contributor of a query profile is identified (Bieber et al., 2006; Curran and Buckleton, 2008). However, the accessibility of relatives to investigators in database queries raises privacy and legal policy concerns, as considerations guiding appropriate inclusion of DNA profiles in databases and subsequent use of those profiles generally focus on the contributors of the profiles rather than on close relatives who are rendered accessible to investigators (Greely et al., 2006; Murphy, 2010). Concerns about privacy vary in magnitude across populations, as false-positive identifications of relatives might be substantially more likely to affect members of populations with lower genetic diversity, and hence a greater likelihood of chance partial matches (Rohlfs et al., 2012, 2013), or members of populations overrepresented in DNA databases (Greely et al., 2006; Chow-White and Duster, 2011).

…[Previously (see above…), w]e showed that records could be matched between databases with no overlapping genetic markers, provided that sufficiently strong linkage disequilibrium (LD) exists between markers appearing in the two databases (Edge et al., 2017). … [The approach] also uncovers privacy concerns, as an individual present in a SNP [single-nucleotide polymorphism] database —collected in a biomedical, genealogical, or personal genomics setting, for example — might be possible to link to a CODIS [Combined DNA Index System] profile, and vice versa, in a manner not intended in the context of either database examined in isolation. First, a SNP database entrant could become accessible to forensic investigation. Second, although in the United States, the use of forensic genetic markers given protections against unreasonable searches is based partly on a premise that these markers provide only the capacity for identification and do not expose phenotypic information (Greely and Kaye, 2013; Katsanis and Wagner, 2013; United States Supreme Court, 2013), phenotypes that are possible to predict from a SNP profile could potentially be predicted from a CODIS profile by connecting the CODIS profile to a SNP profile and then predicting phenotypes from the SNPs. Does cross-database record matching extend to relatives? In other words, is it possible to identify a genotype record with one set of genetic markers as originating from a relative of the contributor of a genotype record obtained with a distinct, nonoverlapping set of markers? If so, then new marker systems in the forensic context could permit relatedness profiling in a manner that is compatible with existing marker systems, as a profile from a new SNP or DNA sequence system could be tested for relationship matches to existing microsatellite profiles. However, a substantial privacy concern would also be raised, as inclusion in a biomedical, genealogical, or personal genomics dataset could expose relatives of the participant to forensic investigation; moreover, phenotypes of a relative could potentially be identifiable from a forensic profile.

[The result?]

We have found that not only can STR and SNP records be identified as belonging to the same individual, in many cases, STR and SNP profiles can be identified as belonging to close relatives—even though the profiles have no markers shared in common.

The possibility of performing familial searching of forensic profiles in SNP databases, while raising new concerns, also alters an existing concern, namely the unequal representation of populations in forensic databases. In profile queries to search for a relative already in a forensic database, populations overrepresented in databases owing to overrepresentation in criminal justice systems are likely to produce more identifications, potentially contributing to further overrepresentation (Greely et al., 2006; Chow-White and Duster, 2011; Rohlfs et al., 2013). Record-matching queries to biomedical, genealogical, or personal-genomic databases, however, will instead produce more identifications in different populations emphasized in genome-wide association and personal genomics (Chow-White and Duster, 2011; Popejoy and Fullerton, 2016; Landry et al., 2017).

Have You Been Shared?

And that’s part of the problem with relationships in an information society: networks are defined as mathematical objects known as graphs, where things (nodes) are connected by edges. So even if you don’t share information about your edges, anyone who shares their edges that includes a link to you means you’ve been shared.

Related: Sharing Goes Both Ways – No Secrets Social and Personal Declarations on Your Behalf – Why Visiting One Website Might Tell Another You Were There etc.

PS on making connections: two people (two nodes) in the same photograph (a shared location; defines a connection / edge of “in the same place at the same time” between the people / nodes. Graph feedstocks are everywhere…