In 2023, you can get your personal genome sequenced for under 1000 Euros. If
you do, you will obtain the data (50G to 100G) on a USB stick or hard drive.
The data consists of many short strings (length 100 or 150) over the famous DNA
alphabet {A,C,G,T}, for a total of 50 to 100 billion letters.
With this personal data, you may want to learn about your ancestry, or judge
your personal risk of getting various conditions, ...
In 2023, you can get your personal genome sequenced for under 1000 Euros. If
you do, you will obtain the data (50G to 100G) on a USB stick or hard drive.
The data consists of many short strings (length 100 or 150) over the famous DNA
alphabet {A,C,G,T}, for a total of 50 to 100 billion letters.
With this personal data, you may want to learn about your ancestry, or judge
your personal risk of getting various conditions, such as myocardial
infarction, stroke, breast cancer, and others. So you must look for certain
(known) DNA variations in your individual genome that are known to be
associated to certain population groups or diseases.
As the raw data ("reads") consists of essentially random sub-strings of the
genome, it is necessary to find the place of origin of each read in the genome,
an error-tolerant pattern search task.
In a medium-scale research study (say, for heart disease), we have similar
tasks for a few hundred individual patients and healthy control persons, for a
total of roughly 30-50 TB of data, delivered on a few USB hard drives.
After primary analysis, the task is to find genetic variants (or, more
coarsely, genes) related to the disease, i.e., we have a pattern mining problem
with millions of features and a few hundred samples.
The full workflow for such a study consists of more than 100_000 single steps,
including simple per-sample steps (e.g., removing low-quality reads), and
complex ones, involving statistical models across all samples for variant
calling. Particularly in a medical setting, each step needs to be fully
reproducible, we need to trace data provenance and maintain a chain of
accountability.
In the past ten years, we have worked and contributed to many aspects of
variant-calling workflows and realized that the strategy to attack the
ever-growing data with ever-growing compute clusters and storage systems will
not scale well in the near future. Thus, our current work focuses on so-called
alignment-free methods, which have the potential to yield the same answers as
current state-of-the-art methods with 10 to 100 times less CPU work.
I will present our recent advances in laying better foundations for
alignment-free methods: engineered and optimized parallel hash tables for short
DNA pieces (k-mers), and the design masks for gapped k-mers with optimal error
tolerance. These new methods will enable even small labs to analyze large
genomics datasets on a "good gaming PC", while investing less than 5000 Euros
into computational hardware.
I will also advertise our workflow language and execution engine "Snakemake", a
combination of Make and Python that is now one of the most frequently used
Bioinformatics workflow management tools, but actually not restricted to
Bioinformatics research.
Read more