You’ve Been Shared… And Your DNA Is Likely Out There…

Via Bruce Schneier (How DNA Databases Violate Everyone’s Privacy), a paper in Science by Ehrlich et al. (Identity inference of genomic data using long-range familial searches)and related news article (Genome hackers show no-one’s DNA is anonymous anymore) showing how your DNA is likely out there thanks to others sharing related DNA on… From the paper abstract:

Consumer genomics databases have reached the scale of millions of individuals. Recently, law enforcement authorities have exploited some of these databases to identify suspects via distant familial relatives. Using genomic data of 1.28 million individuals tested with consumer genomics, we investigated the power of this technique. We project that about 60% of the searches for individuals of European-descent will result in a third cousin or closer match, which can allow their identification using demographic identifiers.

Reminds me of a BBC Radio 4 play I caught a fragment of a week or so ago: a character was identified through his DNA by police, not because his DNA was on record, but that of his son was. DNA + the laws of genetics means that relationships can also be inferred.

From the news article, another paper, this time by Kim et al. (Statistical Detection of Relatives Typed with Disjoint Forensic and Biomedical Loci).

But first, to set the scene, an earlier paper referenced from that one by Edge et al [Linkage disequilibrium matches forensic genetic records to disjoint genomic marker sets]:

With the increasing abundance of genetic data, the usefulness of a genetic dataset now depends in part on the possibility of productively linking it with other datasets. … Such efforts magnify the value of genetic datasets without requiring coordinated genotyping.

One issue that arises in combining multiple datasets is the record-matching problem: the identification of dataset entries that, although labeled differently in different datasets, represent the same underlying entity (67). In a genetic context, record matching involves the identification of the same individual genome across multiple datasets when unique identifiers, such as participant names, are unavailable. This task is relatively simple when large numbers of SNPs are shared between marker sets: if records from different datasets match at enough of the shared SNPs, then they can be taken to represent the same individual.

What if no markers are shared between two genetic datasets? Can genotype records that rely on disjoint sets of markers be linked? Genetic record matching with no overlapping markers has many potential uses. Datasets could become cross-searchable even if no effort has been made to include shared markers in different marker sets. Record matching between new and old marker sets could determine whether an individual typed with a new set has appeared in earlier data, thereby facilitating deployment of new marker sets that are backward-compatible with past sets.

The presence of linkage disequilibrium (LD)—nonindependence of genotypes at distinct markers, primarily those that are proximate on the genome—can enable record matching without shared markers. As a result of LD between markers in different datasets, certain genotype pairs are more likely to co-occur, so that some potential record pairings are more likely than others.

Now back to the Kim et al paper:

Forensic DNA testing sometimes seeks to identify unknown individuals through familial searching, or relatedness profiling. When no exact match of a query DNA profile to a database of profiles is found, investigators can potentially test for a partial match to determine whether the query profile might instead represent a close relative of a person whose profile appears in the database (Bieber et al., 2006; Gershaw et al., 2011; Butler, 2012). A positive test leads investigators to consider relatives of the person with the partial match as possible contributors of the query profile.

Familial searching expands the potential to identify unknown contributors beyond the level achieved when searching exclusively for exact database matches. The larger set of people accessible to investigators—database entrants, plus their relatives—can increase the probability that the true contributor of a query profile is identified (Bieber et al., 2006; Curran and Buckleton, 2008). However, the accessibility of relatives to investigators in database queries raises privacy and legal policy concerns, as considerations guiding appropriate inclusion of DNA profiles in databases and subsequent use of those profiles generally focus on the contributors of the profiles rather than on close relatives who are rendered accessible to investigators (Greely et al., 2006; Murphy, 2010). Concerns about privacy vary in magnitude across populations, as false-positive identifications of relatives might be substantially more likely to affect members of populations with lower genetic diversity, and hence a greater likelihood of chance partial matches (Rohlfs et al., 2012, 2013), or members of populations overrepresented in DNA databases (Greely et al., 2006; Chow-White and Duster, 2011).

…[Previously (see above…), w]e showed that records could be matched between databases with no overlapping genetic markers, provided that sufficiently strong linkage disequilibrium (LD) exists between markers appearing in the two databases (Edge et al., 2017). … [The approach] also uncovers privacy concerns, as an individual present in a SNP [single-nucleotide polymorphism] database —collected in a biomedical, genealogical, or personal genomics setting, for example — might be possible to link to a CODIS [Combined DNA Index System] profile, and vice versa, in a manner not intended in the context of either database examined in isolation. First, a SNP database entrant could become accessible to forensic investigation. Second, although in the United States, the use of forensic genetic markers given protections against unreasonable searches is based partly on a premise that these markers provide only the capacity for identification and do not expose phenotypic information (Greely and Kaye, 2013; Katsanis and Wagner, 2013; United States Supreme Court, 2013), phenotypes that are possible to predict from a SNP profile could potentially be predicted from a CODIS profile by connecting the CODIS profile to a SNP profile and then predicting phenotypes from the SNPs. Does cross-database record matching extend to relatives? In other words, is it possible to identify a genotype record with one set of genetic markers as originating from a relative of the contributor of a genotype record obtained with a distinct, nonoverlapping set of markers? If so, then new marker systems in the forensic context could permit relatedness profiling in a manner that is compatible with existing marker systems, as a profile from a new SNP or DNA sequence system could be tested for relationship matches to existing microsatellite profiles. However, a substantial privacy concern would also be raised, as inclusion in a biomedical, genealogical, or personal genomics dataset could expose relatives of the participant to forensic investigation; moreover, phenotypes of a relative could potentially be identifiable from a forensic profile.

[The result?]

We have found that not only can STR and SNP records be identified as belonging to the same individual, in many cases, STR and SNP profiles can be identified as belonging to close relatives—even though the profiles have no markers shared in common.

The possibility of performing familial searching of forensic profiles in SNP databases, while raising new concerns, also alters an existing concern, namely the unequal representation of populations in forensic databases. In profile queries to search for a relative already in a forensic database, populations overrepresented in databases owing to overrepresentation in criminal justice systems are likely to produce more identifications, potentially contributing to further overrepresentation (Greely et al., 2006; Chow-White and Duster, 2011; Rohlfs et al., 2013). Record-matching queries to biomedical, genealogical, or personal-genomic databases, however, will instead produce more identifications in different populations emphasized in genome-wide association and personal genomics (Chow-White and Duster, 2011; Popejoy and Fullerton, 2016; Landry et al., 2017).

Have You Been Shared?

And that’s part of the problem with relationships in an information society: networks are defined as mathematical objects known as graphs, where things (nodes) are connected by edges. So even if you don’t share information about your edges, anyone who shares their edges that includes a link to you means you’ve been shared.

Related: Sharing Goes Both Ways – No Secrets Social and Personal Declarations on Your Behalf – Why Visiting One Website Might Tell Another You Were There etc.

PS on making connections: two people (two nodes) in the same photograph (a shared location; defines a connection / edge of “in the same place at the same time” between the people / nodes. Graph feedstocks are everywhere…

Author: Tony Hirst

I'm a lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.