NOTE on May 23: OCLC has posted corrected numbers. I’ve corrected them in the post below; the changes are mainly fractional. So you can ignore the note immediately below.
NOTE a couple of hours later: OCLC has discovered a problem with the analysis. So please ignore the following post until further notice. Apologies from the management.
Ever since the 1960s, publishers have used ISBN numbers as identifiers of editions of books. Since the world needs unique ways to refer to unique books, you would think that ISBN would be a splendid solution. Sometimes and in some instances it is. But there are problems, highlighted in the latest analysis run by OCLC on its database of almost 300 million records.
Number of ISBNs
Percentage of the records
So, 78% of the OCLC’s humungous collection of books records have no ISBN, and only 1.6% have the single ISBN that God intended.
As Roy Tennant [twitter: royTennant] of OCLC points out (and thanks to Roy for providing these numbers), many works in this collection of records pre-date the 1960s. Even so, the books with multiple ISBNs reflect the weakness of ISBNs as unique identifiers. ISBNs are essentially SKUs to identify a product. The assigning of ISBNs is left up to publishers, and they assign a new one whenever they need to track a book as an inventory item. This does not always match how the public thinks about books. When you want to refer to, say, Moby-Dick, you probably aren’t distinguishing between one with illustrations, a large-print edition, and one with an introduction by the Deadliest Catch guys. But publishers need to make those distinctions, and that’s who ISBN is intended to serve.
This reflects the more general problem that books are complex objects, and we don’t have settled ways of sorting out all the varieties allowed within the concept of the “same book.” Same book? I doubt it!
Still, these numbers from OCLC exhibit more confusion within the ISBN number space than I’d expected.
MINUTES LATER: Folks on a mailing list are wondering if the very high percentage of records with two ISBNs is due to the introduction of 13-digit ISBNs to supplement the initial 10-digit ones.
At the Future Forum conference in Dresden, I had the opportunity to hang out with Ranga Yogeshwar, a well-known television science journalist in Germany. We were deep into conversation at the speakers dinner when I mentioned that I work in a library, and he mentioned that his grandfather had been an earlly library scientist. It turns out that his grandfather was none other than S.R. Ranganathan, the father of library science. Among other things, Ranganathan invented the “Colon Classification System” (worst name ever) that uses facets to enable multiple simultaneous classifications, an idea that really needed computers to be fulfilled. Way ahead of his time.
So, the next day I took the opportunity to stick my phone in Ranga’s face and ask him some intrusive, personal questions about his grandfather:
The curator starts by presenting the engine with a basic set of keywords. CIThread scours the Web for relevant content, much like a search engine does. Then the curator combs through the results to make decisions about what to publish, what to promote and what to throw away.
As those decisions are made, the engine analyzes the content to identify patterns. It then applies that learning to delivering a better quality of source content. Connections to popular content management systems make it possible to automatically publish content to a website and even syndicate it to Twitter and Facebook without leaving the CIThread dashboard.
There’s intelligence on the front end, too. CIThread can also tie in to Web analytics engines to fold audience behavior into its decision-making. For example, it can analyze content that generates a lot of views or clicks and deliver more source material just like it to the curator. All of these factors can be weighted and varied via a dashboard.
I like the idea of providing automated assistance to human curators…
I’m embarrassed to say that I just read Randall Munroe’s fabulous color survey from early May. Readers were asked to supply names for colors. It’s a rich experiment: Naming and discrimination, gender differences, hacking, tagging, spamming, hilariousness. The results also seem to support prototype theory’s idea that we agree on what the “real” (prototypical) colors are, at least within a culture: This is blue, but that one is a variant that needs a modifier in front of it (“light blue”) or for which we use a variant name (“teal”).
Randall writes the webcomic XKCD, of course, which is the Doonesbury of his generation, except while you can imagine Garry Trudeau writing a satiric HBO series, you can’t imagine him running and analyzing a color survey.
(I heard about Randall’s color survey via the Mainstream: Christopher Shea at the Boston Globe blog. Christopher also points to Stephen von Worley’s color map. BTW, that post by Christopher also has a great note about iPad censoring a graphic version of the oft-banned James Joyce’s Ulysses. Anyway, I’ve really got to do a better job keeping up with XKCD.)
Data.gov has announced that it’s making some data sets available as RDF triples so Semantic Webbers can start playing with it. There’s an index of data here. The site says that even though only a relative handful of datasets have been RDF’ed, there are 6.4 billion triples available. They’ve got some examples of RDF-enabled visualizations here and here, and some more as well.
Data.gov also says they’re working with RPI to come up with a proposal for “a new encoding of datasets converted from CSV (and other formats) to RDF” to be presented for worldwide consideration: “We’re looking forward to a design discussion to determine the best scheme for persistent and dereferenceable government URI naming with the international community and the World Wide Web Consortium to promote international standards for persistent government data (and metadata) on the World Wide Web.” This is very cool. A Uniform Resource Identifier points to a resource; it is dereferenceable if there is some protocol for getting information about that resource. So, Data.gov and RPI are putting together a proposal for how government data can be given stable Web addresses that will predictably yield useful information about that data.
When both Oprah and Lifehacker, two of the most respected names in Life Advice, both recommend not labeling things “miscellaneous,” a lonely-yet-proud voice must speak up. (I mean me, by the way.)
First, telling us to “banish” the miscellaneous is silly. We have miscellaneous drawers and bins because classification schemes are imperfect. Go ahead and try to follow Oprah’s and Lifehacker’s advice on your kitchen’s miscellaneous drawer. You’ll either have to create so many ridiculous sub-divisions that you won’t remember them, or you’ll have to force objects into categories so thinly related that you can’t remember where you put them. That’s why you have a miscellaneous drawer in the first place.
Second, their complaint is about the use of the the miscellaneous category for physical objects. In the digital world, giving objects multiple categorizations and allowing multiple classification schemes â€” what I mean by the miscellaneous in that book I wrote a couple of years ago â€” makes more sense than trying to come up with a single, perfect, fits-all-needs, univocal classification system.
Clay Shirky’s masterful talk at the Web 2.0 Expo in NYC last September — “It’s not information overload. It’s filter failure” — makes crucial points and makes them beautifully. [Clay explains in greater detail in this two part CJR interview: 12]
So I’ve been writing about information overload in the context of our traditional strategy for knowing. Clay traces information overload to the 15th century, but others have taken it back earlier than that, and there’s even a quotation from Seneca (4 BCE) that can be pressed into service: “What is the point of having countless books and libraries whose titles the owner could scarcely read through in his whole lifetime? That mass of books burdens the student without instructing…” I’m sure Clay would agree that if we take “information overload” as meaning the sense that there’s too much for any one individual to know, we can push the date back even further.
The little research I’ve done on the origins of the phrase “information overload” supports Clay’s closing point: Info overload isn’t a problem so much as the water we fishes swim in. When the term was popularized by Alvin Toffler in 1970′s Future Shock, Toffler talked about it as a psychological syndrome that could lead to madness (on a par with sensory overload, which is where the term came from). By the time we hit the late 1980s and early 1990s, people aren’t writing about info overload as a psychological syndrome, but as a cultural fact that we have to deal with. The question became not how we can avoid over-stimulating our informational organs but how we can manage to find the right information in the torrent. So, I think Clay is absolutely spot on.
I do want to push on one of the edges of Clay’s idea, though. Knowledge traditionally has responded to the fact that what-is-to-be-known outstrips our puny brains with the strategy of reducing the size of what has to be known. We divide the world into manageable topics, or we skim the surface. We build canons of what needs to be known. We keep the circle of knowledge quite small, at least relative to all the pretenders to knowledge. All of this of course reflects the limitations of the paper medium we traditionally used for the preservation and communication of knowledge.
The hypothesis of “Too Big to Know” is that in the face of the new technology and the exponentially exponential amount of information if makes available to us, knowledge is adopting a new strategy. Rather than merely filtering — “merely” because we will of course continue to filter — we are also including as much as possible. The new sort of filtering that we do is not always and not merely reductive.
A traditional filter in its strongest sense removes materials: It filters out the penny dreadful novels so that they don’t make it onto the shelves of your local library, or it filters out the crazy letters written in crayon so they don’t make it into your local newspaper. Filtering now does not remove materials. Everything is still a few clicks away. The new filtering reduces the number of clicks for some pages, while leaving everything else the same number of clicks away. Granted, that is an overly-optimistic way of putting it: Being the millionth result listed by a Google search makes it many millions of times harder to find that page than the ones that make it onto Google’s front page. Nevertheless, it’s still much much easier to access that millionth-listed page than it is to access a book that didn’t make it through the publishing system’s editorial filters.
But there’s another crucial sense in which the new filtering technology is not purely reductive. Filters now are often simultaneously additive. For example, blogs act as filters, recommending other pages. But blogs don’t merely sift through the Web and present you with what they find, the way someone curating a collection of books puts the books on a shelf. Blogs contextualize the places they point to, sometimes at great length. That contextualization is a type of filter that adds a great deal of rich information. Further, in many instances, we can see why the filter was applied the way it was. For blogs and other human-written pieces, this is often explained in the contextualization. At Wikipedia, it takes place in the “About” pages where people explain why they have removed some information and added others. And the point of the Top 100 lists and Top Ten Lists that are so popular these days is to generate reams and reams of online controversy.
Thus, many of our new filters reflect the basic change in our knowledge strategy. We are moving from managing the perpetual overload Clay talks about by reducing the amount we have to deal with, to reducing it in ways that simultaneously add to the overload. Merely filtering is not enough, and filtering is no longer a merely reductive activity. The filters themselves are information that are then discussed, shared, and argued about. When we swim through information overload, we’re not swimming in little buckets that result from filters; we are swimming in a sea made bigger by the loquacious filters that are guiding us.
From Chris Csikszentmihaly, Director of the MIT Center for Future Civic Media:
CALL TO NEWS ORGANIZATIONS
In the response to the earthquake in Haiti, many organizations worked to create sites where people could find one another, or least information about their loved ones. This excellent idea has been undermined by its success: within 24 hours it became clear that there were too many places where people were putting information, and each site is a silo. The site Haitianquake.com began scraping – mechanically aggregating – the most popular such sites, like http://koneksyon.com and American Red Cross Family Links. As people within the IT community recognized the danger of too many unconnected sites, and Google became interested in helping, they turned their work over to Google which is now running an embeddable application at: http://haiticrisis.appspot.com/
We recognize that many newspapers have put precious resources into developing a people-finder system. We nonetheless urge them to make their data available to the Google project, and standardize on the Google widget. Doing so will greatly increase the number of successful reunions. Data from the google site is currently available as dumps in the standard PFIF format on this page , and an API is being developed, and licensed through Creative Commons. I am not affiliated with Google – indeed, this is a volunteer initiative by some of their engineers – but this is one case where their reach and capacity can help the most people.
Please feel free to contact me if you have any questions about the reasoning behind this request. Any questions about the widget or its functionality or features are best directed to Google.
Christopher P. Csikszentmihalyi. Director, MIT Center for Future Civic Media