5. Research Data Management#
5.1. What is research data management?#
Research data management includes the storing of data so that it can be actively worked on by the owner, sharing of the data with collaborators, long term archiving/publication of (important parts of) the data and publishing of the data.
The organisation and documentation of the data is also part of research data management. For published data sets—typically published together with a manuscript (see Reproducible publications)—the documentation should be to such a standard that other readers of the manuscript can understand and re-use the data.
Research data include experimentally captured data sets, processed data, and data computed through computer simulation. The software involved in these steps and meta-data that describes the data are also research data.
Research data is created in virtually any research activity.
Read this section if you do not know how to store all the data you generate or need to analyse.
5.2. Research data life cycle#
For most research projects, there are different stages of working with the relevant research data:
5.2.1. Stage 1: data capture#
Initially, data is captured (for example through experiment or computer simulation), and iteratively refined: output from first initial analysis of the data may guide the experiment or choice of simulation parameters that lead to the next data set recorded.
5.2.2. Stage 2: data analysis towards publication#
This data capture phase is followed by identification of the most interesting parts of the data, and then an in-depth data analysis of those data sets leading (in many cases) to a publication.
Stage 1 and 2 may merge into each other, for example if the experiment/simulation is available without time limitation and further data capture can be carried out following the analysis.
During stage 1 and 2, we need access to the data sets with a short latency. Typically, such data is stored “on disk” (in particular not on tape).
5.2.3. Stage 3: post publication and data archival#
Ideally, the data set is published together with the publication (see Reproducible publications and Edmond – open research data repository), and such a data set publication automatically ensures that an archive of the data exists.
If the data set is not published for some reason, it needs to be archived nevertheless (see Archiving research data). The Rules of conduct for good scientific practice (Section 2.4) mention a retention period of 10 years for research data (as of February 2023).
For economic reasons, such archived data is commonly stored on tape. This means access to the data is slow: it may take minutes, hours or days to get the data back from tape into a system where it can be read and re-processed.
5.3. Research data associated with a publication#
Please see Reproducible publications.
5.4. Dealing with data files#
The MPLD summarises some guidance on naming and handling files. Small and important files should be kept under Version control.
During data capture and data analysis, it is desirable to have backups of the data files to be ready to respond to unexpected hardware failure or user error (such as accidentally deleting important data files). Depending on the file size, this may or may not be possible. For simulation studies on High Performance Computing installations, see the recommendations at the end of Storage and quotas.
To have a backup of local data in the cloud, MPG researchers can also use the GWDG OwnCloud solution for small data sets. By default, this provides 50GB of storage. The Keeper tool provides 1 TB of storage space for MPG researchers and their collaborators that can be used to have a backup of the data in the cloud.
5.4.1. Publishing of data sets#
For the publication of any data set (see Reproducible publications), data sets of the size order of TB can be deposited with Edmond – open research data repository; the Max Planck Society’s research data repository.
Note that data sets published with Edmond (see Edmond – open research data repository), are automatically archived.
5.4.2. Archival of data sets#
5.5. File formats#
5.5.1. General guidance#
The MPLD summarises some guidance on file formats.
A lot of thought and planning can go into choosing a suitable file format to keep the data of a research project. In the rare case of starting a research project where there are existing standard or legacy file formats or conventions, one should choose file formats that are feasible and convenient to assess in the future.
There are a number of metrics that can be optimised in choosing such a suitable file format. These include: simplicity, human-readability, compression, write speed, read speed, accessibility on different platforms, open documentation, widespread adoption in the community, number of files and open source. Please seek advice if desired.
In addition to storing data in the files, we also need to document the meaning of the stored data (i.e. provide metadata to make the data interpretable).
5.5.2. Domain specific file formats#
Some communities have created domain specific data file formats that are self describing and - through this design - embed the metadata automatically.
For example:
NeXuS - Common data format for neutron, x-ray and muon science (https://www.nexusformat.org)
openPMD - OPEN stand for Particle-Mesh Data files (https://www.hzdr.de/publications/Publ-27962, https://www.openpmd.org)
If you can use such a self-describing file format (or develop your own), this simplifies creating meaningful archival of the data.
It provides other benefits: all data sets using the same file format can be processed by data analysis tools that support the file format.