Promises of Privacy and Anonymity
Consumer genetic testing companies are expected to be worth $45 billion by 2024. This industry has experienced rapid growth, especially in the last decade. All it takes is shipping a saliva sample in the mail and in less than six weeks you can discover relatives you never knew existed, your ethnicity estimate, find out whether you are the carrier of a genetic mutation (that could predispose you to certain diseases or health complications) and much more. There is one huge drawback though..genomic data is the ultimate individual identifier.
All of the large consumer genetic testing companies claim in their privacy policies that the genetic data of customers is de-identified. In most cases the agency will let you choose whether or not you want to opt-in or opt-out of genetic research with a third party. (Approximately 80% of 23 & Me customers opt-in to allow their de-identified genetic data be used for research) Drugmaker GlaxoSmithKline PLC recently paid 23 & Me $300 Million for access to their de-identified genetic database. In most cases your PII (Personally Identifying Information) is stored completely separate from any genetic information. Your personal information is assigned a random customer identification number. Your genetic data is only identified using a barcode system. Keeping PII and genetic data in physically separate computing environments is considered the industry standards for security.
Consumers Top Concerns
The genetic testing company’s promise of our data being de-identified should make us feel better, right? Here’s the problem though: with the use of machine learning de-identified data can now be re-identified. Genomic data is highly distinguishable. There are approximately 5 million SNPs (Single Nucleotide Polymorphisms) in a person’s Genome, it has been reported that a sequence of only 30 to 80 SNPs is enough to uniquely identify an individual. Keep in mind that genetic variation from individual to individual is only about 0.5%. Most of the genetic testing technologies are already using some form of machine learning or AI(Artificial Intelligence) to function. DNA sequences are often un-ordered and unstructured. Bioinformatics is what allows us to put the DNA in an order that makes it usable. Population Geneticists use one-hot encoding to transform the DNA alphabet into a binary code. This puts the data into a format that can be used with deep learning. It also reduces the data size by throwing away all the non-mutated locations in the genome because they don’t carry the information that is useful to us. (The only data that is useful to us is the part of your DNA that is DIFFERENT than others)
Examples of Re-identification
In 2008 James Watson (with Frances Crick discovered the double-helix DNA model) decided to release his sequenced genome to a public database, however, he made the decision to leave out his APOE gene. APOE= Apolipoprotein E. APOE is a gene on chromosome 19 that is involved in making a protein that helps carry cholesterol and other types of fat in the blood stream The APOE E4 allele is the major known risk-factor gene for late-onset Alzheimers disease. Later on a statistical model was developed that was able to infer Watson’s missing gene with a very high degree of confidence.
Also, in a recent study the authors were able to infer the identity of 50 anonymous male participants whose Y-DNA had been sequenced for the 1,000 Genomes Project. Researchers were not only able to discover the identities of the anonymized men but they were also able to figure out who their family members were using publicly available pedigrees.
In Part 2 I discuss Genetic Privacy Laws and what can be done.. https://www.bentleybiosec.com/blog/2019/genetictestingconcerns2