The SIMCODE-DS project deals with the need of high resolution simulations in view of the advent of what is known as the epoch of “Precision Cosmology”. The latter term indicates the huge quality leap in the accuracy of observational data expected for the next decade (mostly through large galaxy surveys as the European satellite mission Euclid) that will allow to test the cosmological model to percent precision. As a robust interpretation of such high-quality data will require a large number of cosmological simulations, the community will face in the next years a serious issue of big data storage and sharing.
The Scientific Challenge
Cosmological simulations are an essential ingredient for the success of the next decade of “Precision Cosmology” observations, including also large and costly space missions as e.g. the Euclid satellite. Since the required precision and the need to test for statistical anomalies, astrophysical contamination, parameter degeneracies, etc will require a large number of such simulations, the community is about to face the issue of storing and sharing big amounts of simulated data through a Europe-wide collaboration. In fact, cosmological simulations are getting progressively cheaper as computing power increases, and even for the exquisite accuracy and the huge dynamical range that are required for Precision Cosmology, the main limitation will be determined by data handling rather than by computational resources. Also, while large simulations can now be run in a relatively short time taking advantage of highly optimised parallelisation strategies and of top-ranked supercomputing facilities, their information content might require years of post-processing work to be fully exploited. A typical example is given by the Millennium Simulation (Springel et al. 2005) that is now more than 10 years old but is still employed for scientific applications.
The present Pilot aims at testing possible strategies to make large amounts of simulations data available to the whole cosmological community and to store the data for a timescale comparable with the duration of collaboration such as Euclid (~10 years). The main idea behind the project is that various types of simulations (differing by size, dynamical range, physical models implemented, astrophysical recipes, etc) can be safely stored on a central longterm repository and their content made easily available through metadata and indexing procedures to the community at large, which can range from a small group of collaborators to the whole Euclid Consortium (> 1000 people) depending on the specific nature of the stored simulations.
Who benefits and how?
At the moment we have mostly worked on the data production on various supercomputing facilities in Europe and on the data ingestion into the dedicated storage space provided for the Pilot on the PICO machine at Cineca. So far, about 50 TB of simulations data coming from different simulations suites and different supercomputing centres have been moved to PICO. In particular, we have collected data from the Hydra cluster at the RechenZentrum Garching, from the C2PAP cluster at the Leibniz RechenZentrum, from the Sciama cluster at the University of Portsmouth, from the CNAF computing centre in Bologna, and from the former Tier-0 machine Fermi at Cineca.
We are still running new simulations and we will keep moving data into the machine for the whole duration of the project. The current activity related to data storage and sharing is focusing on devising appropriate ways to pack the data into archive ﬁles of a manageable size in order to allow a direct access to the data and on the creation of metadata for these archive ﬁles. Ideally, this would lead to the development of a speciﬁc pipeline (in shell scripting language) that can be run on various simulations formats and produce the relative archive ﬁles in a ﬂexible way. This is under development.
The main goal of the SIMCODE-DS project consists in the implementation of a pipeline capable of organising large amounts of data of cosmological simulations in an indexed structure so to allow an easy browsing of the data, a fast path to the relevant portion of data, and most importantly to a dedicated platform to share and distribute the data to a large community of potential users. This is partly already happening with several collaborative projects that are making use of the simulations data which are transferred to speciﬁc users in different countries for post-processing analysis. Nonetheless, at the moment this is done on the native data format, which means that individual ﬁles have to be selected and transferred (normally by the person who produced them). Ideally, this process should be made easier once the archiving pipeline is in place.
- Marco Baldi, Bologna University, marco.baldi5(a)unibo.it