8. Archiving research data#
As part of Research Data Management, we may need to or want to archive data for longer periods (5 years, 10 years or more).
We can archive data when we do not expect to work on the data in the near future, and if it is acceptable that accessing the data again may take some time (for example because the data needs to be read back from tape before we can access it again). Archiving data is cheaper than keeping it on spinning disks or SSD memory storage, and thus preferred when possible.
Archival of data generally means to create a private backup of the data that is kept in a known location so that you can get back to your data if needed.
In contrast, publication generally means to create a public archive, so that anybody on the planet has easy access to the data. (Typically, a DOI is assigned which acts like a URL at which the data is published and accessible.)
A common strategy (see Reproducible publications) is to publish relevant data along with each publication (for example, at the time of publication). If a publication (i.e. a public archive) is not possible, the data must be archived (in a private archive) at least.
8.1. To archive or not to archive?#
Storage space is not free (or cheap): we should only archive data that could be useful to others or us at some point in the future. On the other hand, we should aim to preserve all data that may be useful in the future. How to decide which data to keep? We propose the following approach:
8.1.1. First, archive data that needs to be archived#
For certain research activities, we may have to archive the data. The prime example are research data associated with publications.
It is also possible that research (grant) agreements require particular data sets to be archived beyond the run time of a (funded) project.
The Max Planck Society has particular expectations relating to archival of data for their students and staff.
8.1.2. Second, consider additional research data sets for archival#
Before we archive a data set which we are not required to for reasons outlined above, we should consider if those data sets can truly be useful in the future (either to us, or others):
Is our documentation of the data sufficiently good that we could make sense of it in 2 years time (for example)?
If we have not managed to analyse the paper now and put it into a manuscript, why do we think we would have more time/capacity to do that later in the future?
Would somebody not familiar with the project be able to benefit from the data? (In particular: have we explained what the data represents in sufficient details?)
How unique or valuable is the data? Data that is rare, costly, or difficult to obtain or reproduce should be prioritized for archiving. Data that is easily available or replicable may not need to be archived.
Important and potentially useful data should be kept, but where it is very unlikely that (further) scientific advances can be made, deleting the data might be a reasonable way forward: this will free resources that can be used for other data sets that may have a higher chance of creating impact.
Long-term storage of data sets—be it for analysis or archival—requires staff time, electricity, hardware, hardware maintenance, refreshing of old tapes, etc. This creates cost, the amount of which is not always visible or known to the researcher.
It is not unusual for an experiment to acquire significant amounts of data; out of which, for example, 20% are used in publications. The question raised in this section is: should we archive the other 80% just in case they contain useful data. There is no generic answer to this, but it might be useful to raise the question, and discuss it between scientists and infrastructure (storage) experts.
8.2. Archival services#
8.2.1. Data archival services for Max Planck researchers#
Max Planck Digital Library: data sets up to 500GB, use of Keeper for Archival (see Keeper). Recommendation for data sets below 500GB.
GWDG Archival service: no space limit. Recommendation for larger data sets. See the specific section in this document for further details.
Max Planck Compute & Data Facility (MPCDF): no space limit. Expected file sizes between 1GB and 1TB. Recommended for data that is already stored at the MPCDF.
If you want to create a (public) archive of data related to a publication, consider “archiving” it through “publishing” the data:
8.2.2. Data publication services: research data repositories#
Edmond – open research data repository; offers publication (and implicitly archival) of data sets (up to order of TB), and provides a DOI. (Restricted to submission to by Max Planck researchers.)
Zenodo offers publication (and implicitly archival) of small data sets (up to 50GB without special requests), and provides a DOI for such submissions.
The REgistry of REsearch data REpositories (re3data.org) is a registry of domain-specific research data repositories.
See also Reproducible publications.
8.3. Add meta data to document the data#
A significant challenge is to document the data.
This includes a description of the format, the meaning of the data, any assumptions made in the capturing or processing of the data.
If software was used to create the data, or if software is required to read the data, then the software should be included in the archive, or the very least a reference to the software repository and version used must be included. Ideally, a computer executable script (or at least human-readable instructions) are included that explain how to install the required software, and its dependencies.
Any information that would be required to (re-)use the data in the future should be included: it should be possible for others to extract, inspect and use the data in the future, without having to consult you or your co-workers to request such information. Any assumptions made or limitations of the data should also be mentioned.
This type of information is part of the meta data. It is required to explain and understand the data.
The use of some Domain specific file formats can much simplify the documentation of data, as—in the ideal case—the metadata is embedded in the data file format automatically.
The metadata should be stored together with the actual data in the archive.
The top level directory of the data set is a place where one would
typically place such documentation, for example in files such as
readme.txt
, or documentation.pdf
.
8.4. Practical aspects storing data for long-term archival#
8.4.1. Data storage hardware#
A common model for archiving data is that the users can transfer their files into their home directory or a special archive directory on a (linux) archive host, where it is stored on hard disks (or solid state storage). From there, the files will be copied onto tape (at a point of time that the archival system chooses), and (at some time) after that the copy on the disk may be removed.
If the user needs to access the archived data again, they can logon to that archive host, and request the data. In the simplest case, this is possible by copying the data to another place, or just by attempting to read the data. At that point, a request will be queued for the data to be read back from the tape, and to be made available on the disk. It can take hours or longer for that request to be fulfilled.
8.4.2. Converting the data set into an archive file for archival#
We assume that the data set is gathered within a directory, which could
be called dataset
, for example. This subdirectory dataset
may
contain files and subdirectories, which in turn may contain more
subdirectories and files.
We assume the data is documented, and that the documentation is part of
the dataset
sub directory.
Before we can archive this data set, we need to convert the set of files
into an archive file, such as archive.zip
or archive.7z
or
archive.tar.gz
.
This has two technical advantages: (i) the whole data set then appears as one file. Archival systems prefer few and large files over many small files. And (ii) our data sets can be compressed in the process. (Note, however, that some archival systems will by default compress any data they receive — in that case, we do not necessarily need the compression here.)
Suitable programs that can create such archive files include zip
and
tar
(see Using zip to create archive files and below).
8.4.3. Table of contents for archive file#
Once the data is on tape, it may take a long time (could be up to days) before it can be played back from tape onto a (disk-based) system where it can be used and explored. It may not be possible to (effectively) extract/view or download just one small file: it may still be required to restore the whole archive first. There is thus potentially high latency in the data access.
For this reason, it is useful to keep a short table-of-contents file per archived data set in a separate location (and on disk) which can be easily accessed, and which has a link to the location of the archived data set. This table of contents file can then be easily (and in particular with non-noticeable latency) opened and examined; for example to search all table-of-contents file for a particular file or data set.
Such a table-of-contents file should be easily readable in a format that is future proof. Suggestions here are plain text files (i.e. not binary and not proprietary formats such as MS Word). See Checksums to check data consistency (after upload to archive).
A number of mark up formats have emerged which can (but don’t have to) be used to provide additional structure in such files. Example are Markdown, ReStructuredtext, and orgmode (in increasing order of complexity and power).
The very minimum that the table-of-contents.txt
file should contain
is:
the list of files in the archive file (together with their file size)
As a recommendation for the table-of-contents file, one should also include
title of the data set
list of authors (including affiliations)
a preferred contact
description of the data (this may include pointers to more detailed documents in the archive, see Add meta data to document the data)
link to experiment (if appropriate)
publication(s) originating from the data set (a DOI per publication is useful)
link to further data sets related to the this data (if appropriate)
list of funding bodies who should be acknowledged if the data is re-used (if appropriate)
if the data will be made public:
a publication reference that users of the data are asked to cite
a license for the use of the data by others
8.4.4. Practical creation of the table-of-contents file list#
To list all files with their size in a given zip file, we can use
zipinfo archive.zip
Note that this will require reading the file, and if the file is on tape, it will have to be retrieved before the command can be completed. One should thus run this command before the data is archived, and keep the output of the command somewhere safe and accessible to provide a catalogue of archived data files.
If we are using tar
, a corresponding command for a compressed
archive with name archive.tar.gz
would be
tar tfvz archive.tar.gz
With 7zip
, we can use:
7z l archive.7z
If the data is not yet converted into an archive file, we can create the
list of files in the subdirectory SUBDIRNAME
using a command (on
Linux and OSX) such as
ls -R -l SUBDIRNAME
The ls
command lists files in a subdiretory. The additional option
-R
requests to list Recursively all files in all subdirectories. The
option -l
requests the Long file format, which includes the size of
each file.
To convert the output from running these commands into a file, we can redirect it, for example
zipinfo archive.zip > list-of-files.txt
8.4.5. Checksums to check data consistency (after upload to archive)#
Checksums are small blocks of data that are computed from a large set of data. Checksum calculation algorithms are designed so that a change in the large set of data will result in a different checksum. Checksums can thus be used to detect (accidental or malicious) changes in the data.
When we create an archive using zip
, a checksum per file is created
automatically, and stored with the archive. Using the command
unzip -t archive.zip
the checksums are re-computed, and compared with the checksums stored in the file. Any deviation will be highlighted.
It is recommended to check the checksums after transferring large amounts of data at the archival site: a random bit-flip is unlikely to occur but if the amount of data is significant, the probability for such a bit flip increases.
Good archival systems will do checksum tests internally and automatically once the data has arrived on their site, but it is the responsibility of the user to make sure the transmitted data arrives correctly.
Todo
add example
8.4.6. Using zip
to create archive files#
This section describes most essential steps of creating (large) archive files using the zip tool.
To convert files in a subdirectory SUBDIRNAME
into a zipped archive,
we can use the command
zip -r archive.zip SUBDIRNAME
If we do not need or want compression to be used, we can use the -0
switch:
zip -r -0 archive.zip SUBDIRNAME
To unpack files from a given zip file, we can use
unzip archive.zip
To unpack one file (such as README.txt) from a given zip file, we can use
unzip archive.zip SUBDIRNAME/README.txt
To compute checksums (to detect corruption of the archive file):
unzip -t archive.zip
To create a list of files in the archive:
zipinfo archive.zip
8.4.7. Using 7zip
to create archive files#
7-zip is another archiving program, that can
achieve better compression ratios than zip
. It is an alternative to
using zip
.
p7zip is a port of
7-zip
for POSIX systems (such as Linux and OSX).
Here are some examples for the (command line) usage:
To Add files to the archive file archive.7z
from the subdirectory
SUBDIRNAME
which contains my data set:
7z a archive.7z SUBDIRNAME
(If the archive.7z
file doesn’t exist yet, it will be created. If it
does exist, the SUBDIRNAME
directory and its contents will be added
to the file.)
Print a List of files in the archive:
7z l archive.7z
To extract all files from the archive:
7z x archive.7z
To extract a single file – for example README.txt
– from the
archive, use
7z x archive.7z SUBDIRNAME/README.txt
To Test the integrity of the archive, use
7z t archive.7z
One would expect that this operation extracts each file in the archive, computes a checksum for the file, and compares the checksum with the checksum of the same uncompressed file that was created when the archive was created. We couldn’t find a clear confirmation of this, although there seems to be the view that this is the case.
8.4.8. Using tar
to create archive files#
tar
is a popular tool on Linux/Unix. It has no in-built mechanism to
detect data corruption, and extraction of one file is not possible
without reading the whole archive. You may wish to consider using
zip
instead (See Using zip to create archive files).
To convert files in a subdirectory SUBDIRNAME
into a tarred archive,
we can use the command
tar cf archive.tar SUBDIRNAME
This does not compress the files. To apply compression, we can either
gzip
the tar file:
gzip archive.tar
which will convert the file into archive.tar.gz
. (Other compression
tools could be used, such as bzip2
.)
If the data set is large, it is better to combine the tarring and compression to be carried out at the same time (to avoid having two uncompressed copies of the data on disk at the same time):
tar cfz archive.tar.gz SUBDIRNAME
To unpack files from a given tar.gz
file, we can use
tar xfz archive.tar.gz
To create a list of files in the tar.gz
archive:
tar tfvz archive.tar.gz
To check against file corruption, we can compute a hash manually (before
we transfer the file, and afterwards), and compare that the two hashes
agree. If you don’t know how to do this, best use zip
(see
Using zip to create archive files).
8.4.9. Use of tmux/screen for long-running (copy) sessions#
Transferring data across the network can take a long time (relevant
order of magnitude is Terabytes per day). See below for example
rsync
commands to facilitate this.
If you need to ssh to an HPC machine in order to copy your data from
there to an archive, it is recommended to do this inside a tmux
session (or the older screen
program if you know screen
already).
Assuming your data to be archived is located on machine X, we suggest:
ssh
(from your laptop) to machine Xstart tmux by typing
tmux
enter the relevant copy command (for example
rsync ...
, see below) to copy the data from machine X to an archive machine.detach from and attach to the tmux session as needed (see below)
The advantage of this approach is that you can detach from the tmux
session (Control-B d
), and it will continue running. In particular,
if your ssh connection (from your laptop to machine X) breaks, the copy
command will continue to run inside the tmux session. You can ssh to
machine X again, and reconnect (“attach”) to the tmux session
(tmux attach
) any time you like.
For more details on tmux, please check tmux/tmux
8.5. The GWDG Archival service#
8.5.1. General information#
Please study up-to-date instructions on home page: GWDG Archival service
Additional information:
Preferred size of
archive.zip
orarchive.tar.gz
is between 1 and 4 TBTransfer of data from MPSD to GWDG is expected to be possible at a rate of approximately 30MByte/sec, corresponding to 2.5TB per day. If the observed rate is well below this, it should be investigated.
The archive files will be moved to tape but the directory structure and filenames are online and can be browsed. (But no stub files are available, i.e. one cannot peek into a file.)
archive files should be compressed when uploaded (i.e. the system does not attempt to compress files when moving to tape).
2 copies on tape are stored in separate locations.
8.5.2. Transfer of files to GWDG Archive from Linux/OSX#
Prerequisites: You need to have deposited your public ssh key at the GWDG.
We assume we have a file archive.zip
on our local Linux or OSX
machine, and want to transfer this to the archive service of the GWDG.
Step 1: find the archive location.
SSH to the machine recommended by the GWDG:
ssh USERNAME@transfer.gwdg.de
Once logged in, use
echo $AHOME$
to display the location of your personal archive. Here is an example whereUSERNAME
is replaced byhfangoh
:hfangoh@gwdu20:~> echo $AHOME /usr/users/a/hfangoh
This means we need to copy our
archive.zip
file to the machinetransfer.gwdg.de
in the location/usr/users/a/USERNAME
.Step 2: copy our archive file to the GWDG archive
A good command to do this from the command line is
rsync
:rsync --progress -e ssh --partial archive.zip USERNAME@transfer.gwdg.de:/usr/users/a/USERNAME
This will:
--progress
display a progress bar (optional)-e ssh
use ssh (mandatory)--partial
allow to continue the transfer if it is interrupted for some reason. Recommended for larger files.
Step 3: check that the transfer has been successful
ssh USERNAME@transfer.gwdg.de cd $AHOME unzip -t archive.zip
For a practical problem, one should choose a more descriptive name instead of
archive.zip
. For example2021-physrevletters-90-12222.zip
, to refer to a data set associated with a publication in Physical Review Letters, volume 90, pages 12222, published in 2021. Over the years, many such archive files may accumulate in the same directory, and with the suggested naming convention (or a similar one), it will be easy to associate them with the publication.It is possible to create subdirectories in the archive home
$AHOME
if you wish to do so to structure your collection of archive files differently. However, the general guideline is to deposit few and large files (see above).
8.5.3. Transfer of files to GWDG Archive from Windows#
If you have rsync
installed on your Windows machine, you should be
able to use it as described in the section for Linux/OSX. (It can
probably be installed via Windows Subsystem for Linux, Cygwin,
Chocolatey, …)
An alternative is to use secure FTP (sftp), for example with GUI based
sftp
clients include PuTTY, WinSCP and Cyberduck. However, for large
files, the rsync
method is better: it can continue an interrupted
transfer (because the network dropped, say), whereas sftp
would have
to restart the transfer.
8.6. The MPCDF Archival service#
For the MPCDF archive, the up-to-date and detailed instructions are at MPCDF Archival service. (If you need an introduction, the section The GWDG Archival service provides more details for the GWDG archival service than we provide here for the MPCDF archive service.)
Key information for MPCDF archive (as of early 2024):
archive server is
archive.mpcdf.mpg.de
all data within users’ HOME directories on
archive.mpcdf.mpg.de
will automatically be archived to tape.example command to copy
archive.zip
from local machine to the MPCDF archive:rsync --progress -e ssh --partial archive.zip USERNAME@archive.mpcdf.mpg.de:
to organise backups into subdirectories, you can ssh to
archive.mpcdf.mpg.de
and create the required directories, before usingrsync
to move the data there.as usual for archives, avoid small and very very large files (1GB to 1TB is ideal), see MPCDF Archival service for details.