Submitted by Peter.Wittenburg on Fri, 06/14/2013 - 13:35

In 2012 three experts from EUDAT visited a number of American sites that are involved with data research and development projects: Datanet Indianapolis, RENCI, Oak Ridge National Laboratory, and the Johns Hopkins University Baltimore. The aim of the trip to the eastern part of the United States was twofold: to gain a deeper understanding of the challenges that scientists face when working with large quantities of data, and to find concrete opportunities for international collaboration. We continued this mission in 2013 with a group of seven EUDAT experts (Damien Lecarpentier, Michael Lautenschlager, Daan Broeder, Mark van der Sanden, Johannes Reetz, Rob Baxter, and Peter Wittenburg) travelling to the USA, this time visiting institutions and data projects in the South-West: Los Alamos National Lab (LANL), The University of New Mexico Albuquerque (DataOne), and the San Diego Supercomputing Center (SDSC).

Los Alamos National Lab (LANL)

At LANL the EUDAT experts enjoyed an intensive exchange with members of Herbert van de Sompel’s group, who are well-known for their major contributions to the specification of web-standards and services. For example, the group has been extensively involved with the ORE data and metadata packaging standard, the Open Annotation Format, the Memento Service (adding a time dimension to web archiving) and, most recently, the ResourceSync framework (which allows the content of web archives to be synchronized).

In these discussions, we indicated that EUDAT is extremely interested in collaborating on the Open Annotation Format (OAF). This is particularly relevant as EUDAT is currently working on designing and developing a “common” tool for a specific form of semantic annotation - which is motivated mainly by crowd-sourcing methods that are used in some disciplines and research communities such as biodiversity (LifeWatch) and languages (CLARIN). To achieve a high degree of interoperability, EUDAT should make use of OAF as the underlying format for all its annotation services and LANL involvement will ensure that EUDAT interacts closely with OAF.

The new development ResourceSync is also of great interest to EUDAT because it specifies an interface for replicating, or even synchronizing (bi-directional) data objects, and data replication is one of the core services that EUDAT is providing. EUDAT actually offers different flavours of data replication (safe replication, light replication, and tailored replication) but each of these must adhere to the same kind of protocol. To enable us to collaborate with LANL on this, we agreed to begin by matching EUDAT’s present practices with the current ResourceSynch specifications, after which we will give feedback to LANL. We envisage that further collaboration will then follow.

 

We also attended a special workshop organized at LANL where different research teams presented their latest data solutions. These ranged from IT-based offers (such as computers with special architectures for performing research on big data) to special analytical tools (required for data-intensive scientific analysis). The work on a process database was particularly interesting, since it was based on the assumption that many scientific workflows include reoccurring processing steps. If this assumption turns out to be correct, EUDAT could identify such reoccurring workflows and offer them as default services.

The need for global collaboration at several levels (for example, data intensive solutions, and standards) was stressed by several speakers.

DataONE

At the University of New Mexico, we were fortunate to be able to hold very informative in-depth discussions with researchers involved in the Data Observation Network for Earth (DataONE) project. This project has been running for much longer than EUDAT, and focuses solely on the earth observation domain, in comparison to EUDAT’s cross-disciplinary activities. Despite this difference in scope, we identified many areas of overlap in our approaches, and unsurprisingly found a few major differences too. The core aims of both projects centre on making data more visible, improving data sharing and re-use (by means of appropriate infrastructure), and providing education. During the discussions it became obvious that DataONE and EUDAT share many of the building stones of data management and access infrastructures – such as creating a domain of registered objects (where persistent identifiers and metadata are crucial), ensuring persistence of data by replication in trusted repository networks, and offering easy entry points for depositing research data.

In terms of identifying differences between the projects, one of the major points that we highlighted was the organisational structure of the projects. DataONE is run by full-time experts with an “information science” background. Major core discussions about functions and services are carried out in working groups that integrate a number of domain, topic and IT experts. IT is organized as a pool of experts who provide services for designing and implementing technologies that are pushed forward by the working groups or other community interactions. A lot of effort (equivalent to about 30% of the project’s funding) is devoted to broad community outreach in a wide range of formats, such as training courses, hands-on sessions, summer schools, technology testing sessions, and educational materials. The working group participants seem to be highly committed which is doubtless due to DataONE‘s inspiring framework, and also encouraged by the funding they receive and the fact that participants can write scientific papers about the work done in the groups.

EUDAT is organized somewhat differently to DataOne – this is partly due to the way that European projects need to be structured (which includes some restrictions on how money can be spent). In EUDAT’s case, the user communities are mainly organized under the auspices of the European Strategy Forum on Research Infrastructures (ESFRI) or other similar initiatives. This is important when it comes to defining the services that will be offered and specifying the requirements for those services so as to ensure that the resulting services have a high probability of being widely used by the communities. For EUDAT, it has been very important to focus on the early delivery of the first services (six services are currently being developed and four are now ready to be offered). Delivering the services early on is vital to show that these common services are not just theory, but have practical relevance. One of the limitations on EUDAT is that we do not have the funds to undertake extensive community building, and, even if sufficient funding were available, community building would still be a lot more complicated due to the truly cross-disciplinary nature of the project.

We certainly learned a lot from the interaction with our DataONE colleagues and will adapt our strategies in some areas within the limits imposed by the EUDAT project structure. For example, we will intensify training/education activities, and create principle concept notes on relevant issues that can inspire broad discussions. We are also looking to organize focused workshops to engage experts from other communities that might be interested in collaborating within the EUDAT framework (which will always have a focus on offering concrete data services).

Another benefit of the discussions with DataONE was that we found potential areas for close collaboration between the two projects. One such area is our common interest in making the Research Data Alliance a success. It could also be beneficial to exchange educational material or metadata records, and to set up federation nodes for testing replication technologies.

San Diego Supercomputing Center (SDSC)

SDSC aims to establish a strong position in the new area of science on Big Data by providing appropriate computational facilities (in particular with large memory and a cloud service) and by participating actively in the XSEDE project (to improve networking and interfacing for Big Data). SDSC is also active at various levels of improving interoperability that range from fast networking (as in the XSEDE project) up to semantic interoperability solutions in global collaborations. Another domain of effort for SDSC lies in providing professional frameworks for building and executing scientific workflows based on offering strong expertise and a selection of scientific services that can be included. Maintaining provenance information, coping with large dynamic databases, dealing with complex data formats and making it possible to use ontologies in multi-disciplinary settings are issues that are addressed in some of these projects. SDSC is also very much involved in global neuroscience initiatives, such as INCF. In these contexts SDSC is not only gathering metadata about data sets and developing tools, but is also working towards a persistent network of collaborating centres.

SDSC is also participating in the EarthCube project which is looking into building blocks for future data infrastructures. EarthCube is a project in the area of earth sciences that focuses on data discovery, mining and integration. In this project, the SDSC group has especial responsibility for the discovery work. User requirements extracted from use cases form the basis of a reference architecture. Specifications for proper data management and stewardship and prototyping activities will lead to components that support facilities such as data citation and metadata searching in a federated scenario.

In our discussions with SDSC, we found many areas for possible collaboration, since, for example, offering metadata in federated scenarios (including different semantic domains) and offering advanced workflow services on stored data are issues that EUDAT is working on.

In summary, this visit was of great relevance for EUDAT as it showed us other ways of organizing complex work on data infrastructures, and gave us a deeper understanding of the kind of services and technologies that important groups in the US are working on, as well as enabling us to identify further areas suitable for future collaboration.

 

References

LANL  http://www.lanl.gov/index.php
DataOne http://www.dataone.org/
SDSC http://www.sdsc.edu/
ORE  http://www.openarchives.org/ore/
Open Annotation http://www.openannotation.org/
Memento Service  http://www.mementoweb.org/
ResourceSync http://www.dlib.org/dlib/september12/vandesompel/09vandesompel.html
Research Data Alliance http://rd-alliance.org
EarthCube http://earthcube.ning.com/
XSEDE  https://www.xsede.org/web/xsede12/welcome
INCF http://www.incf.org/