Read mapping is one of the most basic tasks in human genomic data. In DNA Sequencing, a read is an inferred sequence of base pairs (or base pair probabilities) corresponding to all or part of a single DNA fragment. After the high quality data is done pre-processing that’s when the read mapping (also called “alignment”) takes place. If a reference genome is available it is possible to infer which transcripts are expressed by mapping the reads to the reference genome. If there is no reference genome available reads will need to be assembled first with De novo transcriptome assembly. Mapping tools use algorithms that are complex and too slow for aligning all the necessary reads on the human genome, and that is why pre-computing is necessary to be efficient.
Clouds are inexpensive resources that are useful to analyze all the massive human genomic data. HOWEVER, outsourcing human genome computation to a public cloud comes long with HUGE privacy concern. As I’ve mentioned in my other blogs, even a small amount of human genome sequences contain enough information to potentially identify an individual. A possible solution would be to use a secure algorithm to delegate most of the computation on the public cloud and performing the encryption and decryption on the private cloud. With this approach the only thing that would be exposed on the public cloud would be keyed hash values of seeds and 1-mers.