Logo
MOOC

Reproducible Research II - Practices and tools for managing computations and data

How to improve your practices and your ability to manage and process larger amounts of data, complex computations, while controlling your software environment.

Ouvert

5 mai 2025

35 hours

🇬🇧

English

CC-BY-NC-SA

Course description

Following the success of the MOOC "Reproducible research: methodological principles for transparent science", the authors continue exploring reproducibility with a focus on massive data and complex calculations. These two MOOCs complement each other and offer a coherent training program on the subject.

In this second MOOC, you will learn how to manage large datasets and complex computations in controlled software environments, using formats such as: JSON, FITS, and HDF5, platforms like Zenodo and Software Heritage, tools like git-annex, Docker, Singularity, Guix, make, and Snakemake. These concepts are illustrated through a real-life use case on sunspot detection, demonstrating how to work in a reliable and reproducible way.

A new module added for this session proposes exercises that illustrate how the tools and techniques we teach are helpful in the daily practice of computational research. In interviews with experienced practitioners of reproducible research with high expertise in data management and processing, we discuss related tools or techniques. These interviews may help you understand whether you should invest into more elaborate tools or not, and which pitfalls you may stumble upon.

Course objectives

Manage research data:

  • understand the challenges posed by large volumes of data
  • archive code and data on well-known archives such as Software Heritage and Zenodo
  • integrate data into versioning (Git Annex)
  • use structured binary data formats (FITS, HDF5)

Use tools and techniques for controlling the software environment:

  • understand how software packages are built and managed
  • deploy software environments as containers (ex: Docker)
  • manage software environments using a functional package manager (ex: Guix)
  • work in controlled software environments on a daily basis

Automate long or complex computations using workflows:

  • understand the challenges of scaling up: long calculations, distributed calculations
  • choose a workflow tool adapted to your needs
  • automate a data analysis using make and snakemake
  • control the software environments of a workflow

Who is this course for?

This MOOC is aimed at PhD students, researchers, Master's students, teachers and engineers from all disciplines who wish to learn about reliable publishing environments and tools.

Prerequisites

This course is for everyone who relies on a computer to perform data analysis. You should have some experience with running commands in a terminal, and have a basic knowledge of git (at the level of the first MOOC) and Scientific Python.

Assessment and certification

An Open Badge for successful completion of the course will be issued on request to learners who obtain an overall score of 50% correct answers to all the quizzes and learning activities. Assessment is based on quizzes and practical exercises.

Format

This MOOC consists of four independent modules that combine video lectures, quizzes, pratical sessions, textual course supports, and many exercises for getting hands-on experience with the tools and methods that are presented.

Most of the exercises can be carried out in a JupyterLab environment made available to each MOOC learner. Some exercises require a Linux computer and the possibility to install system software on it.

Course outline

  • Welcome
  • Module Preparing for the MOOC mountain hike to reproducibility
  • Module 1: Managing data
  • Module 2: Managing software
  • Module 3: Managing computations
  • Module 4: Reproducibility in the large
  • Conclusion

Pedagogical team

Authors:

  • Arnaud Legrand, Computer science researcher, CNRS/LIG, Inria, UGA
  • Christophe Pouzat, Neurophysiologist, CNRS/MAP5 Univ. Paris Descartes
  • Konrad Hinsen, Biophysicist, CNRS, Centre de biophysique moléculaire, Soleil

Pedagogical support:

  • Laurence Farhi, learning engineer, Inria Learning Lab
  • Madeline Montigny, learning engineer, Inria Learning Lab
  • Tatiana Khomenko, learning engineer, Inria Learning Lab
  • Benoit Rospart, IT engineer, Inria Learning Lab, Inria Learning Lab
  • Nathan Viaud, IT engineer, Inria Learning Lab, Inria Learning Lab

Partners

With the support of Fonds national de la science ouverte

Voir le cours