(In)Distinguishable from Magic…

A classic physics experiment showing a magical physical world effect – the inverted water cup…

With a little bit of science/physics knowledge, nothing is hidden and the effect is explainable (how it works). No tricks, in other words. The trick is not only self-working, it’s also transparent. Scientific knowledge is the key to the secret.

But are the safety glasses really necessarily? Really?

Here’s the same trick, as magic:

Gimmicks…

The same physics are at work but there’s a hidden element.

There’s also a risk here that people think there is a physics explanation for the trick )(surface tension of water, for example) and the magic leaves them with a misplaced confidence or understanding of the physics…

(Penn and Teller riff on this by showing how a trick is done, breaking the secret, then rerunning the trick – with the same overall effect – but in a way that doesn’t use the secret, thus reinstilling the magic for people who think they know the secret.)

When Arthur C. Clarke wrote “Any sufficiently advanced technology is indistinguishable from magic”, which sort of magic was he referring to? The application of gimmicks, the application of trickery? Or the application of mechanisms that are transparent.

You’ve Been Shared… And Your DNA Is Likely Out There…

Via Bruce Schneier (How DNA Databases Violate Everyone’s Privacy), a paper in Science by Ehrlich et al. (Identity inference of genomic data using long-range familial searches)and related news article (Genome hackers show no-one’s DNA is anonymous anymore) showing how your DNA is likely out there thanks to others sharing related DNA on… From the paper abstract:

Consumer genomics databases have reached the scale of millions of individuals. Recently, law enforcement authorities have exploited some of these databases to identify suspects via distant familial relatives. Using genomic data of 1.28 million individuals tested with consumer genomics, we investigated the power of this technique. We project that about 60% of the searches for individuals of European-descent will result in a third cousin or closer match, which can allow their identification using demographic identifiers.

Reminds me of a BBC Radio 4 play I caught a fragment of a week or so ago: a character was identified through his DNA by police, not because his DNA was on record, but that of his son was. DNA + the laws of genetics means that relationships can also be inferred.

From the news article, another paper, this time by Kim et al. (Statistical Detection of Relatives Typed with Disjoint Forensic and Biomedical Loci).

But first, to set the scene, an earlier paper referenced from that one by Edge et al [Linkage disequilibrium matches forensic genetic records to disjoint genomic marker sets]:

With the increasing abundance of genetic data, the usefulness of a genetic dataset now depends in part on the possibility of productively linking it with other datasets. … Such efforts magnify the value of genetic datasets without requiring coordinated genotyping.

One issue that arises in combining multiple datasets is the record-matching problem: the identification of dataset entries that, although labeled differently in different datasets, represent the same underlying entity (67). In a genetic context, record matching involves the identification of the same individual genome across multiple datasets when unique identifiers, such as participant names, are unavailable. This task is relatively simple when large numbers of SNPs are shared between marker sets: if records from different datasets match at enough of the shared SNPs, then they can be taken to represent the same individual.

What if no markers are shared between two genetic datasets? Can genotype records that rely on disjoint sets of markers be linked? Genetic record matching with no overlapping markers has many potential uses. Datasets could become cross-searchable even if no effort has been made to include shared markers in different marker sets. Record matching between new and old marker sets could determine whether an individual typed with a new set has appeared in earlier data, thereby facilitating deployment of new marker sets that are backward-compatible with past sets.

The presence of linkage disequilibrium (LD)—nonindependence of genotypes at distinct markers, primarily those that are proximate on the genome—can enable record matching without shared markers. As a result of LD between markers in different datasets, certain genotype pairs are more likely to co-occur, so that some potential record pairings are more likely than others.

Now back to the Kim et al paper:

Forensic DNA testing sometimes seeks to identify unknown individuals through familial searching, or relatedness profiling. When no exact match of a query DNA profile to a database of profiles is found, investigators can potentially test for a partial match to determine whether the query profile might instead represent a close relative of a person whose profile appears in the database (Bieber et al., 2006; Gershaw et al., 2011; Butler, 2012). A positive test leads investigators to consider relatives of the person with the partial match as possible contributors of the query profile.

Familial searching expands the potential to identify unknown contributors beyond the level achieved when searching exclusively for exact database matches. The larger set of people accessible to investigators—database entrants, plus their relatives—can increase the probability that the true contributor of a query profile is identified (Bieber et al., 2006; Curran and Buckleton, 2008). However, the accessibility of relatives to investigators in database queries raises privacy and legal policy concerns, as considerations guiding appropriate inclusion of DNA profiles in databases and subsequent use of those profiles generally focus on the contributors of the profiles rather than on close relatives who are rendered accessible to investigators (Greely et al., 2006; Murphy, 2010). Concerns about privacy vary in magnitude across populations, as false-positive identifications of relatives might be substantially more likely to affect members of populations with lower genetic diversity, and hence a greater likelihood of chance partial matches (Rohlfs et al., 2012, 2013), or members of populations overrepresented in DNA databases (Greely et al., 2006; Chow-White and Duster, 2011).

…[Previously (see above…), w]e showed that records could be matched between databases with no overlapping genetic markers, provided that sufficiently strong linkage disequilibrium (LD) exists between markers appearing in the two databases (Edge et al., 2017). … [The approach] also uncovers privacy concerns, as an individual present in a SNP [single-nucleotide polymorphism] database —collected in a biomedical, genealogical, or personal genomics setting, for example — might be possible to link to a CODIS [Combined DNA Index System] profile, and vice versa, in a manner not intended in the context of either database examined in isolation. First, a SNP database entrant could become accessible to forensic investigation. Second, although in the United States, the use of forensic genetic markers given protections against unreasonable searches is based partly on a premise that these markers provide only the capacity for identification and do not expose phenotypic information (Greely and Kaye, 2013; Katsanis and Wagner, 2013; United States Supreme Court, 2013), phenotypes that are possible to predict from a SNP profile could potentially be predicted from a CODIS profile by connecting the CODIS profile to a SNP profile and then predicting phenotypes from the SNPs. Does cross-database record matching extend to relatives? In other words, is it possible to identify a genotype record with one set of genetic markers as originating from a relative of the contributor of a genotype record obtained with a distinct, nonoverlapping set of markers? If so, then new marker systems in the forensic context could permit relatedness profiling in a manner that is compatible with existing marker systems, as a profile from a new SNP or DNA sequence system could be tested for relationship matches to existing microsatellite profiles. However, a substantial privacy concern would also be raised, as inclusion in a biomedical, genealogical, or personal genomics dataset could expose relatives of the participant to forensic investigation; moreover, phenotypes of a relative could potentially be identifiable from a forensic profile.

[The result?]

We have found that not only can STR and SNP records be identified as belonging to the same individual, in many cases, STR and SNP profiles can be identified as belonging to close relatives—even though the profiles have no markers shared in common.

The possibility of performing familial searching of forensic profiles in SNP databases, while raising new concerns, also alters an existing concern, namely the unequal representation of populations in forensic databases. In profile queries to search for a relative already in a forensic database, populations overrepresented in databases owing to overrepresentation in criminal justice systems are likely to produce more identifications, potentially contributing to further overrepresentation (Greely et al., 2006; Chow-White and Duster, 2011; Rohlfs et al., 2013). Record-matching queries to biomedical, genealogical, or personal-genomic databases, however, will instead produce more identifications in different populations emphasized in genome-wide association and personal genomics (Chow-White and Duster, 2011; Popejoy and Fullerton, 2016; Landry et al., 2017).

Have You Been Shared?

And that’s part of the problem with relationships in an information society: networks are defined as mathematical objects known as graphs, where things (nodes) are connected by edges. So even if you don’t share information about your edges, anyone who shares their edges that includes a link to you means you’ve been shared.

Related: Sharing Goes Both Ways – No Secrets Social and Personal Declarations on Your Behalf – Why Visiting One Website Might Tell Another You Were There etc.

PS on making connections: two people (two nodes) in the same photograph (a shared location; defines a connection / edge of “in the same place at the same time” between the people / nodes. Graph feedstocks are everywhere…

When Will It Just Be Machines Talking to Machines?

Do you ever get the feeling that the machines are trying to influence the way you think, or take charge of your communicative acts?

One of the things I noticed for the first time today was that my WordPress editor seems to have started converting some pasted in URLs to actual links, using the pasted URL as the href attribute value and with the link text pulled from the referenced page (WordPress Editor Generates Page Title Links From Pasted URLs). Thinking about it, this is an example of an auto__completion__ behaviour in which the machine has detected some pattern and “completed” it based on the assumption that I intend to “complete” the pattern by turning it from a URL to a web hyperlink.

That is, I paste in X but actually want to represent it as [Y](X) (a link represented in markdown, where Y is the link text and X the target URL or <a href="X">Y</a> (an HTML link).

I imagine most people are familiar with the notion that Google offers a range of autocompletion and autosuggestion terms when you start to type in a Google web search (I don’t think the voice search (yet?) starts to interrupt as you when you ‘ok Google’ it (I don’t knowingly have any voice interfaces activated…))…

What I’ve also noticed over the last few days that a Gmail update seems to have come along with a new, positively set default that opts me in to an autocomplete service there when I’m replying to an email at least:

This service has been available since May, 2018, at least: SUBJECT: Write emails faster with Smart Compose in Gmail.

In look and feel, it’s very reminiscent of code autocompletion support in programming code editors. If you aren’t a programmer, know that computer programmes are essentially composed of fixed vocabulary terms (whether imposed by the language or defined within the programme itself), so code completion makes absolute sense to the people who built the Gmail software application and user interface. Why on earth wouldn’t you want it everywhere…

A couple of things concern me without even thinking about it:

  1. What could possibly go wrong…
  2. Does autocomplete change what people intend to write?

In the paper Responsible epistemic technologies: A social-epistemological analysis of autocompleted web search, Miller & Record write:

[U]sers’ exposure to autosuggestions is involuntary. Users cannot type a search without encountering autosuggestions. Once seen, they cannot “unsee” the results. …

Psychology suggests two likely consequences of involuntary exposure. First, initially disregarded associations sometimes transform into beliefs because humans are prone to source-monitoring errors: subjects mistake the original information source and may put more or less credence in the information than they would have given the correct source (e.g. Johnson et al., 1993). Someone might read [something] and initially disregard it, but later, having forgotten the source, recall having read it. This is supported by the sleeper effect, according to which when people receive a message with a discounting cue, they are less persuaded by it immediately than later in time (Kumkale and Albarracín, 2004). Second, involuntary exposure to certain autosuggestions may reinforce unwanted beliefs. Humans are bad at identifying and rooting out their implicit biases (Kenyon, 2014). Because exposure is involuntary, even subjects hygienic in their epistemic practices may be negatively affected.

[A]utosuggestions interactively affect a user’s inquiry, leading to paths she might not have pursued otherwise. Effectively, if a user looks at the screen, she can’t help but see the autosuggestions, and these impressions can affect her inquiry. Autosuggestions may seem to a user to delimit the possible options or represent what most people find relevant, either of which may change her search behavior. She may change her search terms for one of the suggestions, add or subtract additional terms to rule out or in suggested results. She may abandon her search altogether because the autosuggestions seem to provide the answer or indicate that there is no answer to be found that is she may assume that because nothing is being suggested, no results for the query exist. Furthermore, because the displayed information may be incomplete or out of context, she might reach a different conclusion on the basis of autosuggestions than if she actually visited the linked page.

Altering a user’s path of inquiry can have positive effects, as when he is exposed to relevant information he might not have encountered given his chosen search terms. But the effects may also be negative. … Such derails in inquiry may be deleterious… .

Finally, autosuggestions affect users’ belief formation process in a real-time interactive and responsive manner. “It helps to complete a thought,” as one user put this (Ward et al., 2012: 12). They may thus generate beliefs the user might not have had. Based on autosuggestions, I might erroneously believe [X]. Alternatively, I might come to believe that these things are possible, where before I held no beliefs about them, or I might give these propositions more credence than I would otherwise. Autocomplete is like talking with someone constantly cutting you off trying to finish your sentences. This can be annoying when the person is way off base or pleasant when he seems like your mind-reading soulmate. Either way, it has a distracting, attention-shifting effect that other interactive interface technologies lack.

As an aside, I also note that as well as offering autosuggestion possibilities that intrude on our personal communicative acts, it’s also acting as a proxy that can buffer us from having to engage in those actions. Spam filtering is one example (I tend not to review my spam filter folders, so I’m not sure how many legitimate emails get passed through to it. Hmm, thinks, does a contemporary version of the OSS Simple Sabotage Field Manual markdown include suggestions to train corporate spam filters on legitimate administrative internal emails?)

A good example of creeping intermediation comes in the form of Google Duplex, a voice agent / assistant demoed earlier this year that can engage in certain phone-based, voice interactions on your behalf. It’s about to start appearing in the wild on Pixel phones (Pixel 3 and on-device AI: Putting superpowers in your pocket).

One of the on-device features that will be supported is a new Call Screen service:

You can see who’s calling and why before you answer a call with the help of your Google Assistant. …

  1. When someone calls, tap Screen call.
  2. The Google Assistant will … ask who’s calling and why. Then you’ll see a real-time transcript of how the caller responds.
  3. Once the caller responds, choose a suggested response or an action. Here are some responses and what the caller will hear:
    • ​​Is it urgent? – “Do you need to get a hold of them urgently?”
    • Report as spam – “Please remove this number from your mailing and contact list. Thanks, and goodbye.”
    • I’ll call you back – “They can’t talk right now, but they’ll give you a call later. Thanks, and goodbye.”
    • I can’t understand – “It’s difficult to understand you at the moment. Could you repeat what you just said?”

But not actually “transfer” the call to the user so they can answer it?!

According to Buzzfeed (The Pixel 3: Everything You Need To Know About Google’s New Phone), the Call Screen bot will answer the phone for you and challenge the caller: “The person you’re calling is using a screening service and will get a copy of this conversation. Go ahead and say your name and why you’re calling.” This raises the interesting question of how another (Google) bot on the calling side might respond…

(By the by, thinks: phone receptionists – the automated voice assistants will be after your job…)

It’s probably also worth remembering that:

[s]ometimes Call Screen may not understand the caller. To ask the caller to repeat themselves, tap I can’t understand. The caller will hear, “It’s difficult to understand you at the moment. Could you repeat what you just said?”

So now rather than you spending a couple of seconds to answer the phone, realise it’s a spam caller, and hang up, you have to take even more time out waiting on Call Screen, reading the Call Screen messages and training it a bit further when it gets stuck? But I guess that’s how you pay for its freeness.

Anyway, as part of your #resistance defense toolkit, maybe add that phrase to your growing list of robot tells. (Is there a full list anywhere?)

As well as autocomplete and autosuggest, I note the ever engaging Pete Warden blogging recently on the question of Will Compression Be Machine Learning’s Killer App?:

One of the other reasons I think ML is such a good fit for compression is how many interesting results we’ve had recently with natural language. If you squint, you can see captioning as a way of radically compressing an image. One of the projects I’ve long wanted to create is a camera that runs captioning at one frame per second, and then writes each one out as a series of lines in a log file. That would create a very simplistic story of what the camera sees over time, I think of it as a narrative sensor.

The reason I think of this as compression is that you can then apply a generative neural network to each caption to recreate images. The images won’t be literal matches to the inputs, but they should carry the same meaning. If you want results that are closer to the originals, you can also look at stylization, for example to create a line drawing of each scene. What these techniques have in common is that they identify parts of the input that are most important to us as people, and ignore the rest.

Which is to say: compress the image by creating a description of it and then generating an image based on the description at the other end. A picture may save a thousand words, but if the thousand words compress smaller than the picture in terms of bits and bytes, that makes sense to the data storage and transmission folk, albeit at the trade off of increased compute requirements on either side.

Hmm, this reminds me of a thinkses from over a decade ago on The Future of Music:

My expectation over the last 5 years or so was that CD singles/albums would start to include remix applications/software studios on that medium – but I’ve been tracking it as a download reality on and off for the last 6 months or so (though it’s been happening for longer).

That said – my expectation of getting the ‘src’ on the CD was predicated on the supply of the remix application on the CD too, rather than it being pre-installed on the users’ computer.

The next thing I’m looking out for is a ‘live by machine’ gig, where a club franchise has real hardware/synths being played at a distance by the band, who are maybe in another venue owned by that club chain?

For this, you have to imagine banks of synths receiving (MIDI) control signals over the net from the real musicians playing live elsewhere.

This is not so much online jamming (or here: eJamming) – where you mix realtime audio feeds from other musicians on the web with your own efforts – as real time creation of the music from audio generators…

It’s also interesting to note that the “reproducibility” requirement associated with shipping the software tooling required to let you make use of the data (“predicated on the supply of the remix application on the CD too”), as well as the data, was in my thinking even then…

Pete Warden goes on:

It’s not just images

There’s a similar trend in the speech world. Voice recognition is improving rapidly, and so is the ability to synthesize speech. Recognition can be seen as the process of compressing audio into natural language text, and synthesis as the reverse. You could imagine being able to highly compress conversations down to transmitting written representations rather than audio. I can’t imagine a need to go that far, but it does seem likely that we’ll be able to achieve much better quality and lower bandwidth by exploiting our new understanding of the patterns in speech.

I even see interesting possibilities for applying ML compression to text itself. Andrej Karpathy’s char-rnn shows how well neural networks can mimic styles given some examples, and that prediction is a similar problem to compression. If you think about how much redundancy is in a typical HTML page, it seems likely that there would be some decent opportunities for ML to improve on gzip. This is getting into speculation though, since I don’t have any ML text compression papers handy.

Ah ha…

Tangentially related, ramblings on Google languaging: Translate to Google Statistical (“Google Standard”?!) English? and Google Translate Equilibrium Finder. FWIW, these aren’t machine generated “related” items: they’re old thoughts I remembered blogging about before…)

Wordpress Editor Generates Page Title Links From Pasted URLs

Noting that if I paste a URL into my WordPress.com visual editor, behind the scenes it can look up the link, pull a page title back, and create a link using the title as link text and the link set to the URL I pasted in:

I’m not sure if this requires any particular metadata on the page referenced by the link? Certainly, it doesn’t seem to work for every URL? But then, Pete Warden’s blog – what do you expect?!;-)

Will Compression Be Machine Learning’s Killer App?

Here’s a closer look, watching the page traffic that’s returned using browser developer tools (View->Developer in Chrome):

This is what’s returned:

{"success":true,"data":{"body":"<a href="https:\/\/petewarden.com\/2018\/10\/16\/will-compression-be-machine-learnings-killer-app\/">Will Compression Be Machine Learning’s Killer App?","attr":{"width":676,"height":1000}}}


And this is what was sent:

I wonder if the same mechanic is used to embed Youtube videos when you paste in a Youtube URL? Although that may be done in the web page itself (you can generate the Youtube embed code simply by extracting the video ID from the pasted URL and constructing the embed code using a template.)