28 No. 1
Data: The Quest for a Universal Format
by Robert Lancashire and Tony Davies
To quote a recent report from the International Council for Science (ICSU)1:
“Because of the critical importance of data and information in the global scientific enterprise, the international research community must address a series of new challenges if it is to take full advantage of the data and information resources available for research today. Equally, if not more important than its own data and information needs, today’s research community must also assume responsibility for building a robust data and information infrastructure for the future.”
IUPAC has been intimately involved with this challenge for many years, particularly with respect to getting instrument vendor consensus while creating and maintaining internationally recognized vendor-neutral scientific data formats suitable for interchange between analytical instruments, laboratories, reference data collectors, and archives. With the additional importance now being placed on this work—especially by ICSU—it is worthwhile to review the situation with respect to scientific data and to examine some of the prospects for the development of a universal spectroscopic data format.
IUPAC Scientific Data Standards
Many of the ideas that predominated when standard file formats were originally designed in the 1980s are perhaps no longer appropriate, given the rapid technological changes that have made file storage less expensive and more reliable and the interconnection of equipment so much faster. New industrial regulations may require exact copies of original data to be made available electronically for inspection for many years after the actual experiments are performed. This has been especially relevant in the pharmaceutical industry, where the U.S. Food and Drug Administration in 2000 brought out a set of guidelines (21 CFR Part 11) outlining the steps needed to make electronic records legally equivalent to paper records. These rules were initially accompanied by explanatory guidelines that interpreted the rules in an extremely strict manner—one that could actually not be met by any of the scientific computing equipment currently available. At that time, one of the only data migration solutions that could even come close to meeting the regulation’s requirements was the IUPAC JCAMP-DX series of data standards. Although the standards provided the essential framework to satisfy the regulators, the implementations that were commercially available had to be enhanced outside the published IUPAC standard definitions to be fully compliant. More recently, the U.S. Environmental Protection Agency brought out similar requirements on long-term electronic data storage.
Lawmaking aside, one point remains clear—if you need to produce and work with electronic analytical data, it should be stored in a standard, well-documented, vendor-neutral format. Even ignoring hardware compatibility issues, experience has shown that many instrument vendors themselves cannot produce the correct documentation or converters for their older legacy data formats.2
Display of an IR spectrum on a web page using a browser plug-in. Interpretation of the IR of acetophenone.
State of the Art
So what is the current state of affairs? A survey of spectroscopic data types in common use suggests that more than 100 different formats are being used to store essentially similar information types. This is a substantial increase from 1997, when a review of the common formats then in use and the applications available for their manipulation was published.3 However, just listing applications and file extensions misses one of the more fundamental problems involved with tackling the long-term availability of analytical data stored in electronic form. Behind each file format often lies a series of different formats or versions in which each instrument software release has slightly changed the format, despite retaining the old file extension. Instrument manufacturers often attempt to maintain backward compatibility within their own software, but doing so makes the life of the archivists very difficult. In one notorious case, a pharmaceutical company was generating files with a particular file extension from an analytical spectrometer in a completely different binary format than that documented by the manufacturer. Upon closer investigation, it turned out that a development prototype software version had been installed at the customer’s site without the knowledge of the main company, just because doing so solved a few technical problems for the installation engineer. The development prototype had never made it through the software validation cycle, and the development direction stopped. But the pharmaceutical customer was not told that the instrument software they were relying on was an unvalidated, unreleased prototype with essentially no support. One more reason to move to independent, vendor-neutral standard formats for long-term archiving!
If other scientific data formats are included, it is particularly surprising that many of these formats—at least 50 alone—are for molecular graphics files. Given that the object of many of these formats is for recording x, y, and z coordinates, this is unusual, although the de facto industry standards such as CIF, PDB, and MOL file formats probably account for more than 85% in terms of overall acceptance. The JCAMP-CS protocol supported by IUPAC,4 one of the initial attempts by an international standards body in this arena, has not been adopted by software developers to any great extent.
With respect to spectroscopic instruments and data, the situation does seem to be improving. Although the number of proprietary data formats is large, it appears to be relatively stable; furthermore, with instrument company mergers taking place, the number may well be decreasing. A measure of the success of an IUPAC project on data protocols is the uptake by both instrument vendors and software developers. The IUPAC/JCAMP-DX standards project should therefore be considered a success: Almost all spectroscopic instrument manufacturers include an export option to JCAMP-DX, and more than 30 different software packages use JCAMP-DX for both import and export of data files.5 Most commercial spectroscopic database packages incorporate data entry through files in JCAMP-DX format, as do the chemometrics packages found in the analytical field. (For more information on this project or to receive copies of the published protocols and latest drafts of new versions, go to <www.jcamp.org>.)
The new International Spectroscopic Data Bank, which went live in 2003, has adopted the IUPAC JCAMP-DX standards for spectroscopic data deposition and presentation via the Internet (see <www.is-db.org> for more information).
From an educational perspective, the availability of spectral data that can be displayed and manipulated in a Web browser via plug-in or Java applets has opened up numerous possibilities that have been grasped both as teaching aids and learning tools. The release in 1997 of the MDL CHIME plug-in, with JCAMP-DX spectral support, served as a catalyst in this development. It is estimated that there have been more than 2 million downloads of the free version of MDL CHIME. Examples of its use include linking IR spectra to vibrational mode animations (see figure on page 12) and linking from NMR spectra to highlight H or C atoms.
XML and Joint Developments
XML has become a buzzword in recent years, and many efforts have been made to generate scientific data storage formats based on this language.
IUPAC started to worry about the proliferation of standard formats and organized a meeting on the subject during the 2001 IUPAC General Assembly in Brisbane, Australia; IUPAC later reported on the meeting in Chemistry International.6 In addition, the Committee on Printed and Electronic Publications (CPEP) Subcommittee on Electronic Data Standards is working with ASTM International Committee E13.15 on an XML standard for analytical data—the Analytical Information Markup Language, or AnIML for short.
The development team is working to make AnIML:
- able to accommodate any type of data
- able to offer an audit trail
- able to utilize digital signatures
Learning from the mistakes of previous standardization efforts, the team has adopted a flexible approach, allowing vendors and organizations to include their own specific fields while standardizing the data dictionaries at a core and technique level. This initiative is taking account of the new world in which we find ourselves—one in which science is facing increasing legal and regulatory scrutiny.
To avoid “reinventing the wheel,” the IUPAC JCAMP-DX and ASTM ANDI ontologies form the basis of the current AnIML draft standard, which is currently being developed and reviewed by various experts.
Development has been split into three main phases:
- phase 1: UV/VIS, MS, NMR, IR, IMS, chromatography
- phase 2: EPR/ESR, NIR, crystallography
- phase 3: chemometrics
For more information, go to the AnIML project Web site hosted by SourceForge at <animl.sourceforge.net>.
Another XML-based IUPAC standardization project focused on analytical data is being led by Michael Frenkel. This standard, ThermoML, is being designed for experimental and critically evaluated thermodynamic property data and is in its final stages (see provisional recommendation).
IUPAC can be proud of playing a leading role in the increasingly important field of the digital preservation of scientific records. In this area, only an independent international standardization body can hope to make headway and navigate the politics of differing instrument vendor positions.
1. International Council for Science. December 2004. ICSU Report of the CSPR Assessment Panel on Scientific Data and Information. ISBN 0-930357-60-4.
2. For more information on the digital preservation of scientific records, see “Cometh A Digital Dark Age?” by Tony Davies, Chem. Int. Nov-Dec 2002, p. 7 or <www.iupac.org/publications/ci/2002/2406/darkage.html>.
3. R.J. Lancashire, “An Introduction to Data Formats,” Spectroscopy Europe, 1997, 9(2), 22–24.
4. J. Gasteiger, B.M.P. Hendriks, P. Hoever, C. Jochum, and H. Somberg, “JCAMP-CS: A Standard Exchange Format for Chemical Structure Information in Computer-Readable Form,” Applied Spectroscopy, 1991, 45(1), 4–11.
5. For a list of some spectroscopic applications
using JCAMP-DX and other data format types [link
6. A. N. Davies, “XML in Chemistry,” Chem. Int. Jul-Aug 2002, p. 3 or <www.iupac.org/publications/ci/2002/2404/XML.html>.
Robert Lancashire <[email protected]> is a professor at the University of the West Indies, Mona Campus, Kingston, Jamaica, and is a member of the IUPAC CPEP Subcommittee on Electronic Data Standards. Tony Davies <[email protected]> works at Waters Informatics in Frechen, Germany, and is chairman of the CPEP Subcommittee on Electronic Data Standards.
last modified 6 January 2006.
Copyright © 2003-2006 International Union of Pure and
Questions regarding the website, please contact [email protected]