March 19, 2011

[2b2k] Melting points: a model for open data?

Jean-Claude Bradley at Useful Chemistry has announced (a few weeks ago) that the international chemical company Alfa Aesar has agreed to open source its melting point data. This is important not just because Alfa Aesar is one of the most important sources of that information. It also provides a model that could work outside of chemistry and science.

The data will be useful to the Open Notebook Science solubility project, and because Alfa has agreed to Open Data access, it can be useful far beyond that. In return, the Open Notebook folks cleaned up Alfa’s data, putting it into a clean database format, providing unique IDs (ChemSpiderIDs), and linking back to the Alfa Aesar catalog page.

Open Notebook then merged the cleaned-up data set with several others. The result was a set of 13,436 Open Data melting point values.

They then created a Web tool for exploring the merged dataset.

Why stop with melting points? Why stop with chemistry? Open data for, say, books could lead readers to libraries, publishers, bookstores, courses, other readers…


August 28, 2010

[2b2k] Scientific transparency vs. trust

Last January, Jean-Claude Bradley, an associate professor of chemistry at Drexel, posted about an assignment he gave his students: He asked them to find five different sources for the properties of a chemicals of their choosing. The results were sobering.

For example, in one case a paper that had spent five months undergoing peer review before being accepted by Biotechnology and Bioprocess Engineering got the water solubility of the chemical extract of green tea (EGCG) wrong. The source of the information had it right — caffeine is 21.7 grams per liter and EGCG is 5g/l — but likely through a transcription error, the number for caffeine got appended to the number for EGCG, resulting in EGCG’s solubility being reported in the paper not as 5 but as 521.7. That number is off by two orders of magnitude, and is so high that you’d think one of the peer reviewers or editors would have caught it. The chain of data in this case goes back through several more sources to a published experiment that, unfortunately, does not contain enough information to enable us (well, chemists like Jean-Claude) to fully judge its accuracy.

Jean-Claude’s point is not that all scientific data is wrong. Rather, it is that “trust should have no part in science.” Instead we should be able to check the sources of data, preferably all the way back to the lab notebooks and the raw instrument readings. That’s the impetus behind Jean-Claude’s open notebook science initiative.

Note that in this case, the correction to the published error is likely to come via a blog, but our ecology does not have an obvious or routine way in which good bloggy information can drive out bad published data. But, no nostalgia here, please! As Jean-Claude’s post shows, for all its peer reviewers and expert editors, the old ecology gave errors a stubborn rootedness.

If you accept that humans are more fallible than we’d like, then you build systems that accommodate change. Paper is not very accommodating in this regard. Worse, its fixity has contributed to our false confidence that we can get things right and know when we’ve done so.


