Data services, technology & expertise: the community perspective

Interview with Alberto Michelini, Director of the National Earthquake Center of the National Institute of Geophysics and Volcanology (INGV), Italy

So Alberto, we know that EUDAT is providing a range of services for storing and managing research data. What kind of data and data producers and consumers are we talking about here?

EUDAT is primarily designed to provide data services for European researchers (although naturally we want to make the data available for world-wide research, wherever appropriate). We have to address the needs of both researchers and members of the general public who are producing or using very large data sets for research purposes. A typical example of this kind of data comes from the continuous series of observations recorded by our networks of seismic stations over time, or from waveform simulations of earthquakes. We also have researchers who generate or work with many small sets of data, such as those resulting from the analysis of ambient seismic noise cross-correlation. In both these cases, the researchers have quantities of data that are either too large to store on their local departmental computer facilities (or their personal computers and laptops) or that need to be moved across to specific high-performance computing facilities and services for analysis.

Right, so given that many researchers are producing or working with these vast quantities of data, what do their research communities actually need to manage all this data effectively?

Basically the research communities need places to store their data – and usually they want to make that data available to other researchers too, so we need good tools that make it possible for people to search for and find specific kinds of data. Many researchers also need ways to move or copy their data to and from the high-performance computing centres where it is processed. As I mentioned before, researchers themselves want to share their data so it can be (re)used by others (and ultimately contribute to society and help solve some of our grand challenges, such as the accurate prediction of seismic ground motions at high frequencies which is very important for the purpose of reducing seismic hazards). However, both the researchers who make their data available to others and the researchers who use others’ data all need to be absolutely certain that the data is managed and stored in a secure and professional and persistent manner.

Thanks. Now we know that the research communities definitely need reliable “library”-type facilities for their data (storage, searching and transferring), wouldn’t it work if all the research institutions just provided their own data storage and management? Could we solve things that way?

Yes, in Europe we could adopt a model whereby individual institutions were responsible for their own data handling, however, it is important to understand the size of investment (in terms of both in time and money) that would be needed to provide suitable data management facilities. The kind of computer system that can adequately store such huge amounts of data is not cheap to install, and providing the right tools for managing the data takes time and requires a lot of resources. For example, one either needs sufficient personnel to handle the data management side of things, or the researchers need to take time from their research to do their own data management – which is certainly not an efficient use of their research time. In any case, we are now at a breakpoint since research communities are, on the one hand, definitely responsible for acquiring and quality-checking their own data, yet, on the other hand, they are having to rely on much more modern “tools” for data archiving, curation and preservation.

In the same way that we can have access to much faster high-performance computer systems by sharing resources (for example, through the PRACE project), we can also enjoy much better data management and storage facilities if we share and work together. And this is where EUDAT comes in. Our purpose is to pool resources across Europe to provide us all with much better research data “library” services than any of us could afford individually. In other words, EUDAT can be “the engine under the hood” to drive the vehicle for all our research communities’ data needs.

Let’s be honest here, Alberto, all of that really sounds great, but are you finding that the research communities leap up and embrace EUDAT immediately?

Well, the reality is that most researchers have far too much to do and too little time in which to do it, so they tend to stick with what they already have and know, as it doesn’t require any extra work on their part right now. You know how it feels when you’re really busy trying to do something, and because you’re focusing on what you’re trying to achieve, you don’t want to take time out to look for or learn a better way of doing it, simply because it would mean that, in the short-term, what you are trying to do now would take a bit longer. As humans, we tend not to look for long-term solutions unless we feel we have plenty of time – but that often catches up with us later. So, at EUDAT, we do need to help the research communities to overcome this very natural inertia and scepticism, and show that the gains to be had from a relatively small investment in EUDAT go far beyond what can be achieved by us all continuing to work as we have been doing on an individual basis. Having said that, one of the very positive signs is that some research communities in the solid Earth sciences have already recognised weaknesses in their data organisation and are in the process of bringing in community-wide standards. This shows that many, many people are starting to appreciate and work towards truly global research data. In any case, it is also important that the research communities develop their own “thin layers” that allow their data to comply with the EUDAT data framework, and thus be fully utilizable .

Wow! That is fantastic news. Even so, when you’ve shown the communities the benefits of joining EUDAT, don’t you still get some researchers who feel it would be better to provide their own data handling, so they know the service will last? After all, EUDAT is only a recent project that is set to finish at the end of this year – how do we know that the services it provides will be available after that?

Good question! And yes, people do indeed have very valid concerns about the long-term viability of the EUDAT data infrastructure – what use would data storage be if the data disappeared a year or two later? It is absolutely vital for the research communities to understand that a major facet of EUDAT’s mandate has always been to develop its data services in such a way that the services will last and continue to be available indefinitely. The EUDAT consortium is literally a network of trusted centres – data centres, HPC centres, and research institutions – all closely connected across Europe to offer communities reliability, sustainability and confidence in the longevity of the services. Many of our European universities have existed for nearly a thousand years now, so I think we can all be confident that they will still be here for a long, long time to come!

Ok Alberto, now you’ve convinced me that these EUDAT data services will provide more storage than my poor wee laptop, and that I’m still going to be able to access my data in another ten years, why should I bother to change how my data is organized to fit with EUDAT?

Ah, I’m glad you brought up that issue. Firstly, it is important to understand that your data may be organised in a relatively ad-hoc manner. Few researchers have had the time to sit down and evaluate suitable data organisation strategies, since they are busy doing their actual research, either producing more data or analysing existing data. As I said earlier, many research communities have recognised this shortfall and are working on establishing standards for their data. All of this means that there are often better forms of data organization that could be adopted than what you may have on your laptop! It makes practical sense if a small number of people in EUDAT work on efficient ways to organize data, and provide the results to all the research communities – rather than having every research community and institution doubling up on this work. Although this seems quite rational and logical, it is somewhat difficult to apply in practice since there are legacy data formats that have been adopted globally which must be preserved. This is where the need for APIs that ease the mapping of these legacy standards into those used by EUDAT comes into play. The great advantage of using these APIs is that EUDAT does not require or force people to substantially reorganize their data. Instead we have developed our services in such a way as to minimise the process of integrating data into the EUDAT infrastructure. Essentially there is simply a thin layer to be developed or implemented in order to glue the data organization for a particular community into EUDAT.

Hmmm, that sounds like it will make it easy to integrate with EUDAT. But I guess it will be a lot of work to install all the EUDAT services and get them working in my institution or community?

On the contrary! The whole point of EUDAT is that we are designed “for the researchers, by the researchers”, so to speak – with help from our data experts, of course. None of us want to waste precious time on getting new services to work, so EUDAT’s services are deliberately designed to be easy to install. We also offer a range of training opportunities to help people use the services, including user meetings and our annual conference. (Please check our website for further details.)

Ok, but what happens if one of your data services doesn’t really do what my research community needs?

Since EUDAT has many researchers from different disciplines and backgrounds, we are well aware that there are times when the needs of a particular research area cannot be met by a one-size-fits-all data solution. Consequently we have designed our services so that it is relatively simple to tweak one of EUDAT’s services to make it work for the specific needs of a particular community and its data services. In general, and as I mentioned previously, the availability of APIs and suitable community web services can be a great help in the process of integrating community data into EUDAT.

Right, and now you’ve really convinced me that EUDAT can help my research community, what do we do next to participate?

Get engaged and step aboard the EUDAT ocean liner to avoid the coming data tsunami! Seriously, no matter how convinced or unconvinced you are, EUDAT warmly invites all research communities to contact us – we’re more than happy to answer questions, and to discuss how EUDAT’s data services can help your researchers handle their data more effectively and efficiently, both now and in the future. Also, remember that I said EUDAT is “for the researchers, by the researchers”? That means we definitely want people to tell us if there are other data services that are needed to manage European research data.

Thanks Alberto! So, to summarise what you’ve been telling me, when all is said and done, European research communities can enjoy a smooth and simple transition into using EUDAT services and becoming part of the global research data community if they follow the steps below.

Understand the benefits of the services EUDAT is offering to the research communities.
Understand how much each community would have to invest individually (in terms of time and resources) to obtain similar benefits.
Overcome natural human inertia and scepticism – go beyond the tendency to keep doing things the same way just to save time in the short-term.
Understand that EUDAT services really will last.
Understand that there are often better ways to organize data than those currently being used within the communities.
Be clear that it is easy to integrate existing data into the EUDAT infrastructure.
Be clear that it is easy to install the EUDAT services.
Be clear that there are easy ways to interact with the EUDAT services, to tweak specific EUDAT services to match community services, and to create thin layers that map community data into the EUDAT data framework.
Contact EUDAT and step aboard!

Data services, technology & expertise: the community perspective

EUDAT CDI

EUDAT Ltd