German Conference on Bioinformatics (GCB) 2020

14 - 17 September 2020,
Virtual Conference
 

GCB 2020-Logo

WS4: BioC++ - solving daily bioinformatic tasks with C++ efficiently

Instructors:

René Rahn, Max Planck Institute for Molecular Genetics, Department of Computational
Biology, Berlin

Svenja Mehringer, Free University Berlin, Department of Mathematics and Computer
Science, Algorithmic Bioinformatics, Berlin

Marcel Ehrhardt, Free University Berlin, Department of Mathematics and Computer Science,
Algorithmic Bioinformatics, Berlin

Abstract:

In this full-day tutorial we are going to teach how to use modern C++ and utilise efficient C++ libraries to rapidly develop tools and scripts for operating on and manipulating large-scale sequencing data. Further, we are going to develop a distributed query search application that efficiently handles the processing of large sequence databases.

Motivation:
The high variability and heterogeneity often observed within various genomic data is challenging for many standard tools, for example for read alignment and variant calling. Often, these tools are wrapped in complicated pre- and postprocessing data curation steps in order to obtain results with higher quality. However, these additional steps incur a high maintenance and performance burden to the established work process and often do not scale with larger data sets. Seldomly, C++ is considered as the language of choice for these small processes, although it is the main language used in high-performance computing. We are going to show that implementing modern C++ can be as easy as using other modern high-level languages.

In addition, analysis pipelines are challenged by the massive amount of generated sequencing data for two reasons. First, almost every analysis pipeline requires some sort of indexing the data, which works well for small databases such as a single human genome but are incapable of indexing hundreds of giga bytes of data. Second, due to the short sequencing times, reference indices become quickly outdated and reference indices must be recomputed.

Course outline:
This tutorial is organised as a full-day tutorial split into two parts. In the first part we are going to introduce fundamental concepts and principles of the C++ programming language. Further, we will teach how modern C++ features such as ranges and concepts can be used to rapidly develop high-quality C++ applications. This introduction to C++ follows a practical session were participants will read in typical files from sequencing experiments using the C++ library SeqAn and operate on the data with the taught principles. In the second part after lunch we are going to demonstrate a typical use case within many bioinformatic pipelines namely aligning reads to a large sequence database using state-of-the-art tools. This demonstration follows a practical session, were we are going to implement a distributed search to efficiently handle large sequence databases using the SeqAn library and the SDSL. In the last 30 minutes of the day we are going to summarise the learned concepts and compare the developed methods to current approaches. Further, we are going to give a brief overview over tools and techniques important for high-quality and sustainable software-engineering.

Learning Objectives:

Students will develop

  • skills in developing an application using the C++ programming language
  • skills in using modern C++ libraries to query large sequence databases (STL and SeqAn)
  • knowledge and understanding of modern C++ features, such as ranges and concepts
  • knowledge and understanding about modern and efficient data structures as well as algorithms crucial for large-scale genomic sequence analysis
  • knowledge and understanding about how to develop and sustain high-quality software

Intended audience and level:

This tutorial is mostly suited for computational biologist and bioinformaticians with research focus on sequence analysis (e.g., genomics, metagenomics, proteomics, read alignment, variant detection, etc.). A fundamental knowledge about sequencing experiments and the involved data is required. We expect that attendees have an intermediate knowledge in programming with any high-level programming language, e.g. Python, Java or C++. Some basic C++-knowledge is helpful but not mandatory to successfully complete the course.

The first part of the tutorial is targeting beginners and intermediate C++ developers that want to learn more about modern C++ features like ranges and concepts. The second part of the tutorial builds upon the techniques and principles taught in the first part. Attendees with advanced C++ knowledge that are also acquainted with C++17/20 features can also join the second part that focus on searching large sequence databases.

Requirements:

Attendees should bring their own laptop.
Software for the tutorial can be installed beforehand, but we will also dedicate some extra time for installing required software during the tutorial.

Prerequisites:

  • Git
    • g++ >= 7
    • SeqAn 3 - (https://github.com/seqan/seqan3)
    • CMake >= 3.12
    or, VirtualBox if the attendee wishes to use the provided virtual image running Ubuntu.
DECHEMA e.V.

 

Supported by