Chemistry International -- Newsmagazine for IUPAC

Chemistry International Text Image Link to

Vol. 30 No. 1
January-February 2008

Internet Connection

Providing brief overviews of helpful chemistry resources on the Web.

ChemSpider and Its Expanding Web:
Building a Structure-Centric Community for Chemists

by Antony Williams

To declare that the worldwide web has changed our lives is really an understatement. The impact on commerce, information exchange, social networking, and provision of access to a myriad of other forms of interaction and data, while breathtaking, continues to expand at an astounding pace. In the domain of chemistry, scientists have long had access to text-based searching provided by any of the primary search engines such as Google, Yahoo, and so on. When those search efforts are facilitated by a provider such as Google and focused on patent searches¹ and literature articles² then chemists can directly probe those domains of information. In a similar way chemists today have access to tens of thousands of chemistry articles via searches on platforms including PubMed,³ Google Scholar,⁴ and ChemRefer.⁵

While the general nature of text-based searches provides a familiar environment for chemists to search and review their results, chemists’ natural affinity for communicating via chemical structures demands the need to perform searches in a “natural language.” Ask a chemist about his/her preferred manner for searching chemistry databases and he/she will generally state they prefer structure-based searching. There are certainly commercial solutions that provide chemical-structure-based searches of literature and patent data (CAS,⁶ Infochem,⁷ and Elsevier MDL⁸ to name a few) as well as a myriad of solutions for managing in-house organizational data collections.

In a world changed forever by the dominance of web-based searching and the freedom that blogging now offers to scientists in terms of creativity, criticism, and connectivity, a focus on the management of only highly curated, peer-reviewed data is leaving untouched the information deposited and exchanged across the web on a daily basis. With daily updates of Open Access⁹ articles, with online theses now exposing research,¹⁰ and Open Notebook Science¹¹ starting to grab the attention of scientists, chemistry that was previously “lost”¹² might be available for all to review, if only it could be found. The challenge is finding chemistry—
specifically chemical structures across the web. Maybe ChemSpider can help?

ChemSpider¹³ was initially developed as a hobby project by a small group of dedicated cheminformatics specialists. The intention was to aggregate and index available sources of chemical structures and their associated information into a single searchable repository that would be available to everybody, at no charge. The success of the PubChem project¹⁴ had demonstrated both the value and attractiveness of an online structure database for facilitating the connections between structures and associated data. While PubChem delivered on its mission to host and disseminate data associated with the Molecular Libraries Roadmap Initiative,¹⁵ the plethora of possible extensions to such an approach to provide value to the chemistry community remained attractive to the ChemSpider team.

ChemSpider, unveiled to the public on 24 March 2007 in time for the Spring ACS meeting, delivered on one of the initial concepts. There are tens if not hundreds of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them. Many of these databases are for profit and there is no way to easily determine the availability of information within these commercial or open access databases. One of the initial concepts for ChemSpider was to aggregate into a single database all chemical structures available within open access and commercial databases and to provide the necessary pointers from the ChemSpider search engine to the information of interest. When the system first went online, only the PubChem data sources, around 10 million structures, were hosted as a proof of concept. As of this writing, the ChemSpider database has indexed over 8 million additional unique chemical compounds. The data hosted today comes from over 80 different data sources, including include chemical vendors, chemistry database vendors, online chemistry resources, aggregated data sets from the literature, virtual libraries, and user submissions.

Rather than provide here a detailed examination of the functions and capabilities associated with ChemSpider, the reader is pointed to the online overview¹³ of the capabilities of the system. The capabilities include flexible text and structure-based searching of the database to facilitate structure identification, the text-based searching of over 50 000 open access chemistry articles, structure/substructure-based searching of U.S. and European patents, structure and spectra deposition to share data across the chemistry community, and the prediction of chemical properties using software provided by a series of
collaborators.

ChemSpider continues to grow in its reach into the chemistry community with a number of specific missions:

Improving the quality of available information. A number of blog postings¹⁶ have pointed to the quality of information available in online databases. With millions of indexed compounds, ChemSpider has enabled a community-based curating process¹⁷ to help improve the association between a chemical compound and a set of identifiers (systematic names, trade names, synonyms, registry numbers).
Increased access to chemistry-related information. There are many types of data and information that can be associated with chemical compounds and made available to the benefit of the chemistry community. As an example, the association of analytical data¹⁸ has been demonstrated, the integration to patent searches,¹⁹ and, presently in progress, the integration to QSAR-based modeling.²⁰ These efforts will continue.
Provide access to online tools and services. ChemSpider already serves up the online prediction of certain chemical properties for chemists to take advantage of and a number of software algorithms provided by collaborators will be added into the system. Web services such as the recently released InChI²¹ and OpenBabel services will continue to be made available as a service to the community.²²

ChemSpider is proving to be a success based on a number of measures. On average over 1 200 unique visitors frequent the site every day.²³ Tens of thousands of transactions are initiated monthly. The community continues to expand as more and more people register to become data depositors and curators. The real success comes from the acknowledgment that real-world problems are being solved and that information is being found in a facile manner, and at no charge to the user, to allow them to make decisions and move on.

The intention for ChemSpider remains true to its initial vision—to build a structure-centric community for chemists. The manner by which we get there is changing with experience and available tools, but hopefully we will be part of the overall team of passionate individuals working to make the worldwide web searchable by chemical structures, improving accessibility to scientific information, and speeding the process of discovery. ChemSpider will continue to demonstrate the potential of the semantic web.²⁴

Acknowledgments
I would like to acknowledge the development team of ChemSpider. They continue to amaze me with their passion, energy level, and commitment to making a difference. We have used a number of Open Source tools on ChemSpider, but I would especially like to thank the OpenBabel team, the InChI team, and the JSpecView team. My acknowledgements to ACD/Labs, Igor Tetko, and ChemAxon for providing prediction algorithms for the system. My personal thanks to many of my fellow bloggers who keep the conversations entertaining, especially Joerg Wegner, Egon Willighagen, Jean-Claude Bradley, Peter Murray-Rust, and Rich Apodaca. My thanks to the advisory group of over 20 people from across the industry—you help make it all possible.

References
1. Google Patent Search, Beta: http://www.google.com/googlepatents/about.html
2. Google Scholar, http://scholar.google.com/intl/en/scholar/about.html
3. Pubmed Central, http://www.pubmedcentral.nih.gov/about/intro.html
4. Google Scholar, http://scholar.google.com/intl/en/scholar/about.html
5. ChemRefer, http://www.chemrefer.com/
chemistry_search.php?page_title=about&submit
6. Chemical Abstracts Service, http://www.cas.org/index.html
7. Elsevier-MDL, http://www.mdl.com/
8. Infochem, http://infochem.de/
9. Directory of Open Access Journals, Chemistry, http://www.doaj.org/doaj?func=subject&cpid=60
10. P. Murray-Rust, The Power of the Scientific eThesis http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=366
11. J.C. Bradley, Open Notebook Science, http://drexel-coas-elearning.blogspot.com/2006/09/open-notebook-science.html
12. R.C. Willis, Raiders of the Lost Chemistry http://www.drugdiscoverynews.com/index.php?newsarticle=1167
13. Antony Williams, An Introduction to ChemSpider, http://www.chemspider.com/docs/
ChemSpider_Overview_SLides_August_2007.pdf
14. PubChem: http://pubchem.ncbi.nlm.nih.gov/
help.html#PubChem_Overview
15. Molecular Libraries Roadmap Initiative: http://nihroadmap.nih.gov/
16. Example Blog Postings about Quality Issues in the content of online databases: http://www.chemspider.com/blog/?p=137 and http://www.chemspider.com/blog/?p=64
17. Curating data on ChemSpider: http://www.chemspider.com/blog/?p=15
18. Uploading analytical data to ChemSpider: http://www.chemspider.com/docs/Uploading_Spectra_onto_ChemSpider.htm
19. Structure/substructure searching patents on ChemSpider: http://www.chemspider.com/docs/
Structure_Searching_of_Patents_Using_ChemSpider.htm
20. Meshing ChemModLab and ChemSpider to Provide an Optimized Workflow for QSAR Modeling and Identification of Bioactive Compounds, http://www.chemspider.com/blog/?p=120
21. InChI, http://en.wikipedia.org/wiki/International_Chemical_Identifier
22. Access to ChemSpider Web Services Starts - Initial Exposure of InChI Related Services, http://www.chemspider.com/blog/?p=135
23. ChemSpider Usage Continues to Grow In a Linear Fashion, http://www.chemspider.com/blog/?p=127
24. The Semantic Web, http://en.wikipedia.org/wiki/Semantic_web

Antony Williams is the host of ChemSpider. He has spent over a decade in the commercial scientific software business as chief science officer for Advanced Chemistry Development. He is an NMR spectroscopist by training with over 100 peer-reviewed publications. He has taken his passion for providing access to chemistry related information and software services to the masses and is now applying his time to hosting ChemSpider, working alongside the intellect and innovation making up its development team and immersing himself in the experience of blogging. He can be contacted at <[email protected]>.