|
|
NOTE on May 23: OCLC has posted corrected numbers. I’ve corrected them in the post below; the changes are mainly fractional. So you can ignore the note immediately below.
NOTE a couple of hours later: OCLC has discovered a problem with the analysis. So please ignore the following post until further notice. Apologies from the management.
Ever since the 1960s, publishers have used ISBN numbers as identifiers of editions of books. Since the world needs unique ways to refer to unique books, you would think that ISBN would be a splendid solution. Sometimes and in some instances it is. But there are problems, highlighted in the latest analysis run by OCLC on its database of almost 300 million records.
|
Number of ISBNs
|
Percentage of the records
|
|
0
|
77.71%
|
|
2
|
18.77%
|
|
1
|
1.25%
|
|
4
|
1.44%
|
|
3
|
0.21%
|
|
6
|
0.14%
|
|
8
|
0.04%
|
|
5
|
0.02%
|
|
10
|
0.02%
|
|
12
|
0.01%
|
So, 78% of the OCLC’s humungous collection of books records have no ISBN, and only 1.6% have the single ISBN that God intended.
As Roy Tennant [twitter: royTennant] of OCLC points out (and thanks to Roy for providing these numbers), many works in this collection of records pre-date the 1960s. Even so, the books with multiple ISBNs reflect the weakness of ISBNs as unique identifiers. ISBNs are essentially SKUs to identify a product. The assigning of ISBNs is left up to publishers, and they assign a new one whenever they need to track a book as an inventory item. This does not always match how the public thinks about books. When you want to refer to, say, Moby-Dick, you probably aren’t distinguishing between one with illustrations, a large-print edition, and one with an introduction by the Deadliest Catch guys. But publishers need to make those distinctions, and that’s who ISBN is intended to serve.
This reflects the more general problem that books are complex objects, and we don’t have settled ways of sorting out all the varieties allowed within the concept of the “same book.” Same book? I doubt it!
Still, these numbers from OCLC exhibit more confusion within the ISBN number space than I’d expected.
MINUTES LATER: Folks on a mailing list are wondering if the very high percentage of records with two ISBNs is due to the introduction of 13-digit ISBNs to supplement the initial 10-digit ones.
I had a lovely time at the University of Toronto Faculty of Information yesterday afternoon. About twenty of us talked for two hours about library innovation. It reminded me: how much I like hanging out with librarians; how eager people are to invent, collaborate, and play; how lucky I am to work in an open space for innovation (the Harvard Library Innovation Lab) with such a talented, creative group; how much I love Toronto.
Categories: libraries Tagged with: libraries • personal • toronto Date: May 15th, 2013 dw
I’m very proud to announce that the Harvard Library Innovation Lab (which I co-direct) has launched what we think is a useful and appealing way to browse books at scale. This is timed to coincide with the launch today of the Digital Public Library of America. (Congrats, DPLA!!!)
StackLife (nee ShelfLife) shows you a visualization of books on a scrollable shelf, which we turn sideways so you can read the spines. It always shows you books in a context, on the ground that no book stands alone. You can shift the context instantly, so that you can (for example) see a work on a shelf with all the other books classified under any of the categories professional cataloguers have assigned to it.
We also heatmap the books according to various usage metrics (“StackScore”), so you can get a sense of the work’s community relevance.
There are lots more features, and lots more to come.
We’ve released two versions today.
StackLife DPLA mashes up the books in the Digital Public Library of America’s collection (from the Biodiversity Heritage Library) with books from The Internet Archive‘s Open Library and the Hathi Trust. These are all online, accessible books, so you can just click and read them. There are 1.7M in the StackLife DPLA metacollection. (Development was funded in part by a Sprint grant from the DPLA. Thank you, DPLA!)
StackLife Harvard lets you browse the 12.3M books and other items in the Harvard Library systems 73 libraries and off-campus repository. This is much less about reading online (unfortunately) than about researching what’s available.
Here are some links:
StackLife DPLA: http://stacklife-dpla.law.harvard.edu StackLife Harvard: http://stacklife.law.harvard.edu The DPLA press release: http://library.harvard.edu/stacklife-browse-read-digital The DPLA version FAQ: http://stacklife-dpla.law.harvard.edu/#faq/
The StackLife team has worked long and hard on this. We’re pretty durn proud:
Annie Cain Paul Deschner Kim Dulin Jeff Goldenson Matthew Phillips Caleb Troughton
Neel Smith of Holy Cross is talking about the Homer Multitext project, a “long term project to represent the transmission of the Iliad in digital form.”
|
NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.
|
He shows the oldest extant ms of the Iliad, which includes 10th century notes. “The medieval scribes create a wonderful hypermedia” work.
“Scholarly annotation starts with citation.” He says we have a good standard: URNs, which can point to, for example, and ISBN number. His project uses URNs to refer to texts in a FRBR-like hierarchy [works at various levels of abstraction]. These are semantically rich and machine-actionable. You can google URN and get the object. You can put a URN into a URL for direct Web access. You can embed an image into a Web page via its URN [using a service, I believe].
An annotation is an association. In a scholarly notation, it’s associated with a citable entity. [He shows some great examples of the possibilities of cross linking and associating.]
The metadata is expressed as RDF triples. Within the Homer project, they’re inductively building up a schema of the complete graph [network of connections]. For end users, this means you can see everything associated with a particular URN. Building a facsimile browser, for example, becomes straightforward, mainly requiring the application of XSL and CSS to style it.
Another example: Mise en page: automated layout analysis. This in-progress project analyzes the layout of annotation info on the Homeric pages.
The Digital Public Library of America is looking for an executive director. This is an incredible opportunity to make a difference.
I think it’d be fantastic if this person were to come out of the large, community-based Web collaboration space, but there are many other ways for the DPLA to go right. The search committee is pretty fabulous, so I have confidence that this is going to be an amazing hire.
The DPLA team gave a presentation at Berkman yesterday, and has been showing some initial work, including a collaboration with Europeana and wireframes of a front page. It’s looking very good for the April launch date.
Our little group, the Harvard Library Innovation Lab, is working on a visual browser for books within the DPLA collection, so we’re pretty excited.
Categories: dpla, libraries Tagged with: dpla • libraries Date: December 19th, 2012 dw
Paul Deschner and I had a fascinating conversation yesterday with Jeffrey Wallman, head of the Tibetan Buddhist Resource Center about perhaps getting his group’s metadata to interoperate with the library metadata we’ve been gathering. The TBRC has a fantastic collection of Tibetan books. So we were talking about the schemas we use — a schema being the set of slots you create for the data you capture. For example, if you’re gathering information about books, you’d have a schema that has slots for title, author, date, publisher, etc. Depending on your needs, you might also include slots for whether there are color illustrations, is the original cover still on it, and has anyone underlined any passages. It turns out that the Tibetan concept of a book is quite a bit different than the West’s, which raises interesting questions about how to capture and express that data in ways that can be useful mashed up.
But it was when we moved on to talking about our author schemas that Jeffrey listed one type of metadata that I would never, ever have thought to include in a schema: reincarnation. It is important for Tibetans to know that Author A is a reincarnation of Author B. And I can see why that would be a crucial bit of information.
So, let this be a lesson: attempts to anticipate all metadata needs are destined to be surprised, sometimes delightfully.
Library Journal just posted my article “Library as Platform.” It’s likely to show up in their print version in October.
It argues that there are reasons why libraries ought to think of themselves not as portals but as open platforms that give access to all the information and metadata they can, through human readable and computer readable forms.
Categories: libraries, too big to know Tagged with: 2b2k • libraries Date: September 5th, 2012 dw
[This article is also posted at Digital Scholarship@Harvard.]
Marc Parry has an excellent article at the Chronicle of Higher Ed about using crowdsourcing to make archives more digitally useful:
Many people have taken part in crowdsourced science research, volunteering to classify galaxies, fold proteins, or transcribe old weather information from wartime ship logs for use in climate modeling. These days humanists are increasingly throwing open the digital gates, too. Civil War-era diaries, historical menus, the papers of the English philosopher Jeremy Bentham—all have been made available to volunteer transcribers in recent years. In January the National Archives released its own cache of documents to the crowd via its Citizen Archivist Dashboard, a collection that includes letters to a Civil War spy, suffrage petitions, and fugitive-slave case files.
Marc cites an article [full text] in Literary & Linguistic Computing that found that team members could have completed the transcription of works by Jeremy Bentham faster if they had devoted themselves to that task instead of managing the crowd of volunteer transcribers. Here are some more details about the project and its negative finding, based on the article in L&LC.
The project was supported by a grant of £262,673 from the Arts and Humanities Research Council, for 12 months, which included the cost of digitizing the material and creating the transcription tools. The end result was text marked up with TEI-compliant XML that can be easily interpreted and rendered by other apps.
During a six-month period, 1,207 volunteers registered, who together transcribed 1,009 manuscripts. 21% of those registered users actually did some transcribing. 2.7% of the transcribers produced 70% of all the transcribed manuscripts. (These numbers refer to the period before the New York Times publicized the project.)
Of the manuscripts transcribed, 56% were “deemed to be complete.” But the team was quite happy with the progress the volunteers made:
Over the testing period as a whole, volunteers transcribed an average of thirty-five manuscripts each week; if this rate were to be maintained, then 1,820 transcripts would be produced every twelve months. Taking Bentham’s difficult handwriting, the complexity and length of the manuscripts, and the text-encoding into consideration, the volume of work carried out by Transcribe Bentham volunteers is quite remarkable
Still, as Marc points out, two Research Associates spent considerable time moderating the volunteers and providing the quality control required before certifying a document as done. The L&LC article estimates that RA’s could have transcribed 400 transcripts per month, 2.5x faster than the pace of the volunteers. But, the volunteers got better as they were more experienced, and improvements to the transcription software might make quality control less of an issue.
The L&LC article suggests two additional reasons why the project might be considered a success. First, it generated lots of publicity about the Bentham collection. Second, “no funding body would ever provide a grant for mere transcription alone.” But both of these reasons depend upon crowdsourcing being a novelty. At some point, it will not be.
Based on the Bentham project’s experience, it seems to me there are a few plausible possibilities for crowdsourcing transcription to become practical: First, as the article notes, if the project had continued, the volunteers might have gotten substantially more productive and more accurate. Second, better software might drive down the need for extensive moderation, as the article suggests. Third, there may be a better way to structure the crowd’s participation. For example, it might be practical to use Amazon Mechanical Turk to pay the crowd to do two or three independent passes over the content, which can then be compared for accuracy. Fourth, algorithmic transcription might get good enough that there’s less for humans to do. Fifth, someone might invent something incredibly clever that increases the accuracy of the crowdsourced transcriptions. In fact, someone already has: reCAPTCHA transcribes tens of millions of words every day. So you never know what our clever species will come up with.
For now, though, the results of the Bentham project cannot be encouraging for those looking for a pragmatic way to generate high-quality transcriptions rapidly.
Categories: libraries, too big to know Tagged with: 2b2k • crowdsourcing • transcription Date: September 4th, 2012 dw
I suspect there’s a lot of truth in Richard MacManus’ post at ReadWriteWeb about where Web publishing is going. In particular, I think the growth of topic streams is pretty much close to inevitable, whether this occurs via Branch + Medium (and coming from Ev Williams, I suspect that at the very least they’ll give Web culture a very heavy nudge) and/or through other implementations.
Richard cites two sites for this insight: Anil Dash and Joshua Benton at the Nieman Journalism Lab. Excellent posts. But I want to throw in a structural reason why topics are on the rise rise: authors don’t scale.
It is certainly the case that the Web has removed the hold the old regime had over who got to publish. To a lesser but still hugely significant extent, the Web has loosened the hold the old regime had on who among the published gets attention; traditional publishers can still drive views via traditional marketing channels, but tons more authors/creators are coming to light outside of those channels. Further, the busting up of mass culture into self-forming networks of interest means that a far wider range of authors can be known to groups that care about them and their topics. Nevertheless, there is a limit within any one social network — and within any one human brain — to how many authors can be emotionally committed to.
There will always be authors who are read because readers have bonded with them through the authors’ work. And the Web has enlarged that pool of authors by enabling social groups to find their own set, even if many authors’ fame is localized within particular groups. But there are only so many authors you can love, and only so many blogs you can visit in a day.
Topics, on the other hand, are a natural way to handle the newly scaled web of creators. Topics are defined as the ideas we’re interested in, so, yes, we’re interested in them! They also provide a very useful way of faceting through the aggregated web of creators — slicing through the universe of authors to pull in what’s interesting and relevant to the topic. There may be only so many topics you can be interested in (at least when topics get formalized, because there’s no limit to the things our curiosity pulls us toward), but within a topic, you can pull in many more authors, many of whom will be previously unknown and most of whom’s names will go by unnoticed.
I would guess that we will forever see a, dialectic between topics and authors in which a topic brings an author to our attention to whom we then commit, and an author introduces a topic to which we then subscribe. But we’ve spent the past 15 years scaling authorship. We’re not done yet, but it’s certainly past time for progress in scaling topics.
Categories: blogs, culture, libraries Tagged with: blogs • fame • writing Date: August 16th, 2012 dw
The Berkman Center’s David O’Brien, Urs Gasser, and John Palfrey have just posted a 29-page “briefing paper” on the various models and licenses by which libraries are providing access to e-books.
It’s not just facts ‘n’ stats by any means, but here are some anyway:
“According to the 2011 Library Journal E-Book Survey, 82% of libraries currently offer access to e-books, which reflects an increase of 10 percentage points from 2010. … Libraries maintain an average of 4,350 e-book copies in a collection.”
“[T]he publisher-to-library market across all formats and all libraries (e.g., private, public, governmental, academic, research, etc.) is approximately $1.9B; of this, the market for public libraries is approximately $850M”
92% of libraries use OverDrive as their e-book dealer
Of the major publishers, only Random House allows unrestricted lending of e-books.
I found the section on business models to be particularly clarifying.
Categories: copyright, libraries Tagged with: copyright • e-books • ebooks • libraries Date: July 30th, 2012 dw
Next Page »
|