Data management
In QUAST the management of scientific data will meet the requirements of the DFG guidelines on good scientific practice and handling of research data. All involved institutions have access to large-scale data facilities that ensure reliable long-term data storage. We commit ourselves to store all data related to scientific publications in publicly accessible data repositories. This does not only include the experimental data or numerical output of theoretical simulations, but also the original code needed to reproduce the calculations. All codes developed by the QUAST consortium will be shared through Git repositories.
A particular challenge are standardized data formats since, for the data objects in our field, these are largely absent to date. Hence, central to QUAST is our effort to develop standardized data objects for the modeling of spatio-temporal electron correlations in real materials in line with the FAIR principles (Findable, Accessible, Interoperable, Re-usable). Many of these ideas are in line with the efforts of the National Research Data Infrastructure (NFDI) consortium initiative FAIRmat (FAIR Dateninfrastuktur für die Materialwissenschaften) led by Claudia Draxl (HU Berlin), Matthias Scheffler (Fritz Haber Institut of the MPG and FU Berlin) and Mark Greiner (MPI for Chemical Energy Conversion, Mülheim an der Ruhr). Several of the PI’s within QUAST are members of the FAIRmat consortium, and if FAIRmat is funded we will coordinate our data management and code sharing with FAIRmat.
Defining standardized data structures connecting several research groups calculating the same material properties with the use of different methods, implementations, or codes is by no means a trivial task. We can exemplify this for the time and position dependent generalized n-point Green’s function G(x₁,t₁;...;x₃,t₃), a central object of QUAST. These functions assign a real or complex number or matrix to each time and position coordinate. Depending on the specific implementation one works in coordinate, Wannier, momentum or Bloch-wave space. One can further distinguish between calculations in the time or frequency domain, with either real or imaginary times or frequencies, with either continuous or discrete variables. The transformation between these mathematically equivalent representations of the same object involves, in some cases, a mathematically ill posed problem. Due to the finite accuracy of the representation of our data an otherwise exact transformation from one representation to another back to the first representation can lead to relevant differences between the original and doubly transformed data. In order to deal with these important and often crucial subtleties between the different methods a multitude of different data formats emerged between different research groups. The data are not only stored as a list of complex numbers on a grid, but further information is added (such as moments, high frequency or long time behaviour), the functions are possibly expanded on a basis set, or a singular value decomposition may be used in the intermediate representation to save memory. Each of these different formats come with their own merits and drawbacks.
Despite this heterogeneous landscape of different data formats our community has a large experience in the exchange and reuse of data between different groups and codes. Many of the calculations for real materials use model Hamiltonians with parameters obtained from other (density functional theory) calculations. In order to be able to do this, a multitude of interfaces between density functional theory codes and many body codes have been developed. Within QUAST new codes will be developed in order to allow our research unit to perform the modelling of spatio-temporal electron correlations in real materials. In order to do this efficiently and cooperatively there is a need for common and well documented data formats flexible enough to facilitate the communication between different groups, but not to hinder them in the locally needed representation of the data objects for efficient computations.
In order to facilitate the reusability of our generated data we will create and publish, using git, a code library under a Creative Commons Attribution License. Within this library we will define common data structures and methods implementing transformations between these structures. Each research group can either contribute their data structure to the library and add methods to transform their structure to other shared representations or use one of the common structures in their code. We will define common data formats for the physical quantities of interest such as the n-point Green’s function. Every project is committed to develop, share, and keep up to date conversion routines between the project’s data format and the common data format. This will allow access of every project’s data by every other project and the general public.
Within these file formats we will include sufficient meta data to ensure readability of these files also in future settings. Several data objects used in different codes and methods to transfer between them have been already implemented in the code QUANTY, which is developed by PI Haverkort who will act as QUAST’s data management coordinator. We can use these implementations with minor modifications as a starting point of this shared library.