René Rahn, Max Planck Institute for Molecular Genetics, Department of Computational
Svenja Mehringer, Free University Berlin, Department of Mathematics and Computer
Science, Algorithmic Bioinformatics, Berlin
Marcel Ehrhardt, Free University Berlin, Department of Mathematics and Computer Science,
Algorithmic Bioinformatics, Berlin
In this full-day tutorial we are going to teach how to use modern C++ and utilise efficient C++ libraries to rapidly develop tools and scripts for operating on and manipulating large-scale sequencing data. Further, we are going to develop a distributed query search application that efficiently handles the processing of large sequence databases.
The high variability and heterogeneity often observed within various genomic data is challenging for many standard tools, for example for read alignment and variant calling. Often, these tools are wrapped in complicated pre- and postprocessing data curation steps in order to obtain results with higher quality. However, these additional steps incur a high maintenance and performance burden to the established work process and often do not scale with larger data sets. Seldomly, C++ is considered as the language of choice for these small processes, although it is the main language used in high-performance computing. We are going to show that implementing modern C++ can be as easy as using other modern high-level languages.
In addition, analysis pipelines are challenged by the massive amount of generated sequencing data for two reasons. First, almost every analysis pipeline requires some sort of indexing the data, which works well for small databases such as a single human genome but are incapable of indexing hundreds of giga bytes of data. Second, due to the short sequencing times, reference indices become quickly outdated and reference indices must be recomputed.
This tutorial is organised as a full-day tutorial split into two parts. In the first part we are going to introduce fundamental concepts and principles of the C++ programming language. Further, we will teach how modern C++ features such as ranges and concepts can be used to rapidly develop high-quality C++ applications. This introduction to C++ follows a practical session were participants will read in typical files from sequencing experiments using the C++ library SeqAn and operate on the data with the taught principles. In the second part after lunch we are going to demonstrate a typical use case within many bioinformatic pipelines namely aligning reads to a large sequence database using state-of-the-art tools. This demonstration follows a practical session, were we are going to implement a distributed search to efficiently handle large sequence databases using the SeqAn library and the SDSL. In the last 30 minutes of the day we are going to summarise the learned concepts and compare the developed methods to current approaches. Further, we are going to give a brief overview over tools and techniques important for high-quality and sustainable software-engineering.
Students will develop
This tutorial is mostly suited for computational biologist and bioinformaticians with research focus on sequence analysis (e.g., genomics, metagenomics, proteomics, read alignment, variant detection, etc.). A fundamental knowledge about sequencing experiments and the involved data is required. We expect that attendees have an intermediate knowledge in programming with any high-level programming language, e.g. Python, Java or C++. Some basic C++-knowledge is helpful but not mandatory to successfully complete the course.
The first part of the tutorial is targeting beginners and intermediate C++ developers that want to learn more about modern C++ features like ranges and concepts. The second part of the tutorial builds upon the techniques and principles taught in the first part. Attendees with advanced C++ knowledge that are also acquainted with C++17/20 features can also join the second part that focus on searching large sequence databases.
Attendees should bring their own laptop.
Software for the tutorial can be installed beforehand, but we will also dedicate some extra time for installing required software during the tutorial.