Johannes Köster, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
Marcel Bargull, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
The typical data analyst must simultaneously juggle multiple projects, each having its own duration and software requirements. As few analysts have any formal training on structuring or even writing the code necessary to perform an analysis, it is unsurprising that the iterative analytic process can produce a wide assortment of almost identically named files (e.g., “final_results.txt”, “final_results.version2.txt”, “final_results.really_final.txt”), all with unclear origins and produced with a hodgepodge of similarly poorly named scripts. The near impossibility of tracing a results file to the exact process that produced it creates untold difficulties both when it comes time to publish results as well as when planning subsequent experiments months or years later (afterall, which of the “final_results” files was really the “right one”?). These issues are further compounded by software paths and other similar assumptions being hard-coded into scripts, preventing easy analysis replication elsewhere. Performing analyses in a reproducible and traceable manner is clearly needed to combat such problems.
In this hands-on tutorial, we demonstrate how Conda can be used to deploy specific software versions easily, reproducibly, and without administrator credentials. Moreover, we demonstrate how Conda’s ability to create isolated software environments helps to avoid side-effects between different analyses or different steps of the same analysis. Attendees will also learn how to create conda recipes themselves, so they can contribute new packages to projects such as Bioconda. We further demonstrate how Snakemake can be used in combination with Conda and Containers to create reproducible analyses workflows and executed them on any platform from workstations to clusters and the cloud.
With over 6 million downloads, Bioconda is the leading platform for sustainable distribution of bioinformatics software. With on average over 3 new citations per week, Snakemake is one of the most widely used scientific workflow management systems.
Beginners, Intermediates, Core-Facility Staff
Audience should have basic familiarity with Python, Git, command line.