6. Reproducible publications#

6.1. Introduction reproducibility#

The concept of reproducibility is core to our scientific mission: a result (typically announced through a publication) is reproducible if another team can re-create that result, based on the description given in the publication.

Increasingly, funding bodies, research organisations (including the Max Planck Society) and publishers expect scientists to make their publications fully reproducible.

To make our results reproducible, it should in principle be enough to “just” record all the steps we have carried out to get there. The traditional logbook may thus be sufficient.

Increasingly, however, we use computer-assisted data processing and analysis of experimental data, or computer simulation to create the data in the first place.

As soon as such computation is involved, new challenges arise for reproducibility. For example: when using a post-processing script (perhaps written in Python), we need to archive the script with the raw data to make the computation of the post-processed data reproducible. Furthermore, it would be desirable to know which additional Python libraries and their versions need to be installed to execute the script. We also need to record which part of our (potentially large) data set the Python script was working on.

The quest to answer the question “How can we make (computational) science more easily reproducible” is an active research topic in its own right. (Here are some slides (video) from an introductory seminar on reproducibility from December 2022).

Here, we try to provide some guidance that has proven useful. We also note that making a publication more reproducible is better than not reproducible at all - even if the result may not be perfect.

6.2. How to make publications reproducible?#

Recommendations:

  1. Track and keep all primary files: source code, config files, post-processing, plotting.

  2. record the protocol: how was the simulation configured and called, how was data processed, analysed and results plotted.

  3. record the software environment: what software (and versions) were used.

  4. For studies involving experimental data or HPC-calculations that are hard to repeat: keep important data.

  5. publish all information and data with publication. [If publication of the data is impossible, archive the data.]

We address each of the points in more detail:

6.2.1. Track and keep primary files#

The recommendation is to use version control software to keep track of source code, config files, post-processing scripts, plotting scripts etc.

The most widely used tool at the moment is git. See also Version control.

6.2.2. Record the protocol#

The key to recording the protocol of (the computational part of) our scientific study is to automate the data creation/processing/plotting: if we can write a script (=computer program) that can transform the data to a figure, then the script encapsulates all the required information. (We could attempt the same through manual note keeping in a this-is-what-I-did-readme.txt file, say, but experience shows that it is hard for humans to record every important detail.)

A wide set of tools can be used for this automation of data analysis: Python/Julia/Perl/Matlab/R/… programs, Makefiles and other dependency engines, which may in turn trigger commands to create plots (for Gnuplot, Xmgrace, Matplotlib, …).

The Jupyter notebook is a format that works well for some scientists to automate the creation of analysis and figures in publications: the notebook integrates (a sequence of) commands to execute (with optional free text annotation, explanation, interpretation) with the result of the commands (such as figures and tables).

Using a Jupyter notebook to create a figure file in the publication from raw data, and driving the analysis from inside the notebook (before the figure can be plotted), makes the figure creation automatic and thus more reproducible.

See Beg et al: Using Jupyter for reproducible scientific workflows (2021) for a publication on the topic, and lang-m/2022-paper-multiple-bloch-points for an example github repository hosting a number of notebooks to create central figures in the corresponding publication.

6.2.3. Record the software environment#

The software environment defines the software components that have been used in the computational part of our study.

For example: if we use matplotlib and Python to create a plot from a CSV file, then we should at least record the version of Python (such as 3.11) and the version of matplotlib (such as 3.6.3).

In principle, it would also be good to record all the dependencies of matplotlib. This is more important for quickly-changing research software. Considering the Octopus software, it depends on other (quickly-changing) numerical libraries (such as libxc, fftw etc), and to be able to compile the same (or at least a similar) version of Octopus at a later point, we need to record the versions of all the dependencies as well.

If you are using a tool like Spack to install Octopus, you can run a dedicated command (spack spec octopus) to show the particular configuration, for example (only beginning of output with dependencies on fftw, gsl and libxc shown):

octopus@12.1%gcc@10.3.0~arpack~cgal~cuda~debug~elpa~libvdwxc~libyaml~likwid~metis+mpi~netcdf~nlopt~parmetis~pfft~python~scalapack build_system=autotools arch=linux-debian11-cascadelake
  ^fftw@3.3.10%gcc@10.3.0+mpi~openmp~pfft_patches build_system=autotools precision=double,float arch=linux-debian11-cascadelake
  ^gsl@2.7.1%gcc@10.3.0~external-cblas build_system=autotools arch=linux-debian11-cascadelake
  ^libxc@6.1.0%gcc@10.3.0~cuda+shared build_system=autotools arch=linux-debian11-cascadelake
  ...

If you use a pre-compiled version (on an HPC cluster, for example), you may need to ask your system administrators for help.

The best way of archiving all information for the software environment is to automate the creation of the software environment. Using software build tools such as Spack and Easybuild can be part of that automation. Ideally, the software environment can be build in a container (such as Docker, Singularity, …): this way everything is specified (i.e. including the linux distribution, and any compilers, build tools and additional libraries that may be installed through the Linux distribution before the actual research code is compiled).

6.2.4. Keep important data#

The preservation of primary configuration files, the study protocol, and the software environment may be sufficient to make a study reproducible, for example for a computer simulation based work: If the computation is not demanding, and the software can be built/installed automatically, and necessary runs be executed automatically and completed in reasonable time, then the archiving (and publishing) of data files may not be necessary.

On the other hand, if experimentally gathered data is important, or data has been computed that is computationally expensive to re-compute, the data (or a meaningful subset) should probably be archived (and published).

6.2.5. Publish data with publication#

Ideally, all of the data gathered above is published together with the publication. (This includes the software.) The Max Planck Society provides the Edmond service to published data sets (Edmond – open research data repository.). For a data set deposited with Edmond, a Digital Object Identifier (DOI) is created, which should be cited as a reference in the publication.

If for some reason the data cannot be made public, then it is the responsibility of the authors to archive the data and preserve it (see Archival services). The authors must be able to make the data available on justified request (see for example Nature’s policy on availability of data, materials, code and protocols).

The publication of the data is much preferrable over the archival: publication fulfils the archival requirements, it provides greatest possible transparency and supports reproducibility of the publication, it enables others to re-use the data, and overall it is better use of the tax payer’s money that (probably) has funded the research.

6.3. MPSD Research data policy#

Note

To be added

6.3.1. Register published and archived research data sets at MPSD#

After acceptance of every publication, you should register the associated research data set(s) with the research data management team (data-management@mpsd.mpg.de) by providing the DOI (or archive location) of the relevant research data sets for each of your publications.

It is possible (and advisable) to deposit a Table of contents for archive file for each data set with the research data management team.

6.4. FAQ#

6.4.1. What is the difference between publication and archival of research data?#

  • A published data set is publicly readable. There is generally a DOI provided for published data sets. Here is an example of a research data set https://doi.org/10.17617/3.BPNGXA which has been published on Edmond together with the journal paper.

  • An archived data set is not publicly readable. It is accessible only by the author (and possibly other authorised parties such as co-authors, librarian of your institute etc). If the data is needed, the authorized parties need to retrieve it from the archive and make it available.

    Keeper provides an archival service that can provide a DOI and associated web page which shows some metadata about the archive. If such a DOI exists, it can and should be cited in a publication as reference to the data. https://doi.org/10.17617/4.5a is an example DOI for a data set archived with Keeper.

6.4.2. What license should I use for my published research data?#

It is important to pick a license to allow others to make use of the data.

  • For data, Creative Commons licenses are often used. From the discussion of licenses from Edmond (https://edmond.mpdl.mpg.de/guides/help.html#Dataset-license, accessed August 2023):

    ``Creative Commons licenses were originally created for the legally secure licensing of creative content, not primarily for data. But since version 4, this slightly different nature of data is kept in mind so that CC-BY 4.0 and CC-BY- SA 4.0 are also usable for research data. The advantage of CC licenses is their wide distribution and awareness.

    By licensing your research outputs under CC-BY, your research is openly available, but it is required that others have to give you credit, in the form of a citation, should they use or refer to your research object.’’

  • For software, free and permissive licenses such as BSD and MIT encourage wide distribution and awareness. See further alternatives and discussion at https://the-turing-way.netlify.app/reproducible-research/licensing.html

6.5. Further reading#

See also