34 No. 4
Nowadays, there are many high-level activities focusing on the management of research data sets, their archiving and their re-use—in effect, their publication. On 5 March 2012 ICSTI, the International Council for Scientific and Technical information, convened a one-day workshop, held at ICSU headquarters in Paris, on "Delivering Data in Science" to survey some of the most pressing issues.
The session on Data and the Policy Makers opened with an account by Ray Harris of the SCCID, the ICSU Strategic Coordinating Committee on Information and Data that he chaired between 2009 and 2011. As ICSU is an interdisciplinary and international body, the importance of these recommendations lies in their representing the priorities for science worldwide. The committee's six main recommendations related to scientific data were as follows:
- ICSU National and Union Members should adopt a guide to best practice (presented in the SCCID report) covering aspects of data policy, governance, planning and organization, standards and tools, management and stewardship, and data access. This should help to foster a common view of the significance of these issues across all domains.
- ICSU Members should explore and agree on the terms used under the umbrella of "Open Access" to clarify a very muddled terminology and in consequence help to distinguish and prioritise factors leading to universal and equitable access to publications (guided by ICSTI and INASP) and data (CODATA).
- ICSU Members should improve the whole process of creating data as a publication, with increased academic recognition, appropriate behavior modification, and a possible role for legal deposit libraries in providing infrastructure or services.
- ICSU should use its affiliated organizations, CODATA and the World Data Service, more actively in managing large-scale data activities.
- Practical help needs to be given to less economically developed countries, again using the existing networks of ICSU and its affiliated bodies.
- There should be greater interaction with the private sector to use commercial expertise and resources for mutual benefit.
One potential weakness of SCCID's analysis of data management is that it does not consider separate strategies for "raw" versus "processed" data. In part, this is a philosophical decision—many of the technical challenges of handling electronic information do not depend on the nature of that information within the scientific experiment/publication life cycle. Nevertheless, several later presentations did demonstrate how different strategies needed to be applied in different scientific fields to data that had undergone various stages of processing. In crystallography, IUCr Journals and Commissions have long promoted the exemplary position of requiring coordinates and structure factor amplitudes (our "processed data") to be deposited. The IUCr's Diffraction Data Deposition Working Group is now wrestling with the possible next step of archiving the "raw data."
Four succeeding presentations at the workshop gave a survey of policy and funding support available from national and regional funding organizations, which will have a key role to play in realizing the vision laid out by ICSU.
Stefan Winkler-Nees discussed the recommendations on data of the Alliance of German Science Organizations. Research funding in Germany has a division of responsibilities between the federal government and the regional Lander, and in part by German research institutions' traditionally strong relationships with private industry. Nevertheless, common principles for archiving and free access to publicly-funded research data have emerged that are similar to those of other countries, and there is a significant investment in funding to assist German science organizations to realize these principles. The speaker referred to a frequent perception among some authors that suitable data archiving was to "stuff a CD or DVD in a desk drawer"—clearly not easily accessible to a wide readership and subject to the author remaining alive during the lifetime of the publication. The working group learned that universities, at least in the UK, are beginning to wake up to the need to provide their staff with centralized archives not only as good practice but also to avoid inadvertent research malpractice.
Carlos Morais-Pires of the European Commission described the preparations for the next European framework program for research development and innovation (Horizon 2020) and emphasized the positive commitment to research and development, mirrored by a likely increase in funding of 40–45 percent over the coming seven-year period. Much of the commission's emphasis will be towards Open Science, in the belief that open content, open infrastructures, and an open culture will work together to create optimal sharing of research results and tools. The impetus for data management strategy in this program comes from the influential Riding the Wave report of October 2010.
Rob Pennington of the U.S. National Science Foundation described the cyberinfrastructure for the 21st Century program (CIF21), which will make some USD 200 million available for data infrastructure investment. NSF has been very keen to assess how best to distribute funding within the new and still poorly understood paradigm of data-driven science. Pennington described the detailed consultation and review processes that have informed CIF21. It is built around several grand challenges, but is seeking to provide multidisciplinary and multiscale integration to draw real and useful science out of the sea of data. While NSF feels that it has been "behind the curve" in this area, it is moving forward with a very strong and focused commitment. Already funded projects are required to provide a data archiving plan in grant proposals and account for themselves in their annual reports to the NSF and at the end of their grant awards. This "policin"' of its own policy is an important step in itself.
Runda Liu, a replacement for Peng Jie of the Institute of Scientific and Technical Information of China (ISTIC), described the Chinese Scientific Data Sharing Project of which ISTIC is an active participant. China wishes to follow the western model of data sharing and reuse across research institutes and end users, and has been building up a national distributed network that now includes 10 data centers and over 100 branches and nodes, covering over 3000 databases. China is an active participant in CODATA activities and is enthusiastic about participation in the World Data System. ISTIC works with the Wangfang Data Agency to provide digital object identifier (DOI) registration for scientific data sets in China, and there is significant investment in the development of scientific data classification and navigation systems, and building an Internet platform for scientific data resource information.
A session on "Data in Practice" brought some ground-level perspectives to these high-minded policy objectives. Todd Vision described DRYAD, a system within the life sciences that allows authors to deposit their supporting data sets at the same time as they submit a research article for publication. Currently there are 25 journals with deposition/submission integration in this field; each deposited data set has a unique DataCite-supplied DOI. The philosophy behind the system has been to make it easy and low-cost for authors to deposit their data, and this strategy is broadly working. The down side, however, is that the deposited data sets are described by limited metadata. There is some encouraging evidence of re-use of deposited data sets by other researchers; but there is also some concern that providing too easy a route for deposition might hurt existing curated data centers by diverting material away from them.
If DRYAD handles "long-tail" scientific data, the opposite extreme is faced at the particle physics facility CERN, as described by Tim Smith. In 2012, over 22 petabytes of data were recorded on the Large Hadron Collider (LHC), although this is only a fraction of the amount that can be generated by the experiments. CERN must invest heavily in data filtering procedures to trap only the small fraction of the experimental results that may be of interest to specific experiments. Even then, the large volume of data (most of which is reduced and analyzed in research institutions outside of CERN) requires very large data storage facilities distributed around the world, and dedicated high-bandwidth optical private networks to transport the data between nodes. An interesting feature of particle-physics "information" was that, as one moved along the data pyramid from large volumes of raw data through smaller volumes of processed data to the relatively small volume of published results, the proliferation of multiple copies of more highly processed information actually amplified the data management problem. It was estimated that the 22 petabytes of raw data collected in a year gave rise to a total of 70 petabytes of duplicated and derivative data that needed to be tracked, verified, and reconciled. One beneficial aspect of the data explosion was that for each generation, the archiving of previous generations' output (including content migration to new-generation media) became progressively less burdensome. Another feature of the LHC's data output is that the data is not really digestible by many other research workers; it is almost as if all those that could digest the data are already co-authors on the publication(s)!
Toby Green of OECD demonstrated the visualization and access gateways to data sets published by the Organisation for Economic Cooperation and Development. Where there is a significant holding of well-characterized and homogeneous data, it becomes cost-effective to develop tools to make it easier for end-users to access and visualize those data. For the OECD data sets, simple web-based applications allowed the extraction and combination of data sets in many ways. Very granular dataset DOIs facilitated linking statistical tables to publications, and the potentially difficult issues of tracking time-variable data sets were being tackled initially by detailed versioning.
The after-lunch session on Global Initiatives began with a description by Michael Diepenbroek of the ICSU World Data System, the federation of data centers largely in the earth sciences. Much of the impetus behind this system is the establishment of common norms of quality and interoperability across a very diverse spread of activities, and early attention is focusing on organizational aspects, including the establishment in Japan of a coordinating International Program Office. Among the technical aspects of the new system are the orderly registration of DOIs and linking to associated publications in a way that will give due credit to those collecting and curating the data.
Jan Brase, Managing Agent of DataCite, and overall Chair of the Workshop, explained how DataCite acted to register data DOIs across the sciences.
Geoffrey Boulton previewed the forthcoming Royal Society policy report "Science as an open enterprise", which would discuss the major policy issues surrounding research data management, drawing on recent cases such as the "Climategate" affair and on the perception that the data deluge offers both challenges, in terms of handling vast quantities of data, and opportunities to involve a wider research community, and indeed the citizen scientist.
Françoise Genova closed this session with an account of the Astronomical Virtual Observatory, a good example of a discipline-wide and international approach to data handling and linking to publications.
In the final session, Publishers and Data, three academic publishers gave their perspectives on the integration of data management and archiving with the much longer-established business of learned journals.
Eefke Smit of the International Association of STM Publishers (representing over 100 publisher members) described some individual journal initiatives to enhance scientific articles through linking to associated data sets, and spoke also of the PARSE-Insight survey ("Permanent Access to the Records of Science in Europe") that had identified the current patchy distribution of scientific data archived in orderly and accessible ways. Fred Dylla of the American Institute of Physics (AIP) preferred to emphasise the traditional added value of the publishing enterprise and see integration with supporting data as a simple extension of the existing paradigm. Alicia Wise of Elsevier described some initiatives within the Elsevier stable of journals to enhance linking between articles and data sets.)
Overall, this workshop provided a helpful snapshot of the state of play in making research data available within the framework of the record of science. There are encouraging signs that public policy is well informed and is moving towards encouraging orderly curation of data across many disciplines. Within this framework, public funding is available for well-defined data management activities, and this may provide some resources for individual disciplines to address any needs they have that cannot be met by existing academic funding. There is, of course, huge disparity both in the types of data across different disciplines, and in the sophistication of different communities with the management of their data. This does provide a continuing challenge to publishers, especially large organisations publishing across many different scientific fields. As yet, the ability of publishers to take advantage of specific data handling opportunities seems rather limited. Initiatives such as DOI, which now provides persistent unique identifiers in both the publishing and data worlds, do of course facilitate linking and citation, which are important first steps. But there is still a great deal to do before there is routine validation, visualisation and reuse of data across the whole field of science. It is very beneficial that organisations such as CODATA and ICSTI are both aware of the problems, and well placed to work together with the many relevant stakeholders to bring this vision closer to reality. Full report and workshop presentations are available online.
last modified 10 July 2012.
Copyright © 2003-2012 International Union of Pure and Applied Chemistry.
Questions regarding the website, please contact [email protected]