Joho the Blog
An Entry from the Archives

« Debatepedia launches || Back to Blog | Viral marketing »

October 23, 2007

Berkman lunch: Aaron Swartz on Open Library

Aaron Swartz is giving a Berkman talk on the Open Library project. [As always, I'm typing quickly, missing stuff, getting things wrong. You can hear the whole thing as Media Berkman.]

The basic idea is to give each page a Web page that collects all the information about that book. Books have never had "a first class place on the web." They've been distributed across publishers' Web sites, etc.

The book pages are a "structured wiki." Wikipedia lacks the structure required to let computers access it. So, the OL wiki page has separate fields for all of the metadata about it. E.g., click on the author's name and you get a list of all the books the author has written.

It has to be really open, Aaron says. "This is something that has to be a collaboration among a lot of different people." They've brought in publishers, reviews, authors, etc. It's all available for free, for download or reuse. Anyone can use it.

When books are out of copyright, the OL brings in the full text, when available. But that raises issues about how people want to read books on line he says.

OL also wants to be able to point people to libraries that have copies of books. There are "Buy, borrow or download" options for every book (when possible).

Readers can review books on the site.

The first thing librarian argued about when they saw OL was what subject classification system to use. "We don't have to choose on the Internet. We can store all the category systems and let people choose which ones they want." Likewise with all the different identifiers, e.b., ISBN, OCLC numbers, OL identifiers. ("We have to make our own identifier system because we're going to have more books.")

Ferberization means connecting physical books to all the different abstractions, e.g., print runs, editions, translations, etc. The library world has focused primarily on the physical books on the shelves. "We're going to have to come up with new ways of expressing the relationships," including allowing people to create new relationships, e.g., this book is based on that one, this book refutes that one, this one replaces that one.

They'd like to be able to do print on demand, and mail you a physical copy. Also scan on demand: You pay some money and someone goes and scans it.

Amazon is doing something similar to OL. But Amazon is trying to sell you stuff and doesn't have good info about books that are out of print. Google Books has very few community features. And there's WorldCat from OCLC, but their business model depends on selling information. OL wants to be a public group available to everyone.

Q: English language only?
A: Right now we're English only but internationalization is a huge part of this. We want to get summaries in multiple languages as well as

Q: (terry martin - law school librarian) Journals?
A: Serials are the next task after this. Serials are more complex. They're in vast sets over long periods of time.

Q: (wendy) Fuzzy connections? Is West Side Story an adaptation of Romeo and Juliet?
A: Library systems are generally binary. We have lots of ways of connecting books but we haven't really done anything fuzzy.

Q: User-generated categories?
A: Sure. Tagging.

Q: (jpalfrey) We'd love to hear what you say about how a huge library, such as Harvard Law School Library could contribute...

Aaron now talks about the current status of the project. The software is working well, he says. They worried about it because it combines a database and a wiki in a new ways. They have about 10 million catalog records, including 6M from the Library of Congress and 5M from U of NC. They have about 400,000 full text copies, mainly from the Internet Archive. Publishers have been good about providing info. They're looking for collections of reviews. Publishing on-demand works well; they have machines that print and assemble books in about 5 mins. They're going to repopulate the New Orleans public library with the 400,000 books the OL has. OL wants more data. Also, they need more programmers. "If you love books, we'd love your help soon curating and annotating them."

Q: (sj klein) Interlibrary loan for books in copyright?
A: We want to do digital interlibrary loans. We scan a copy and send you the pdf. Some publishers seem ok with it. Some are going to go ahead with it, with us as their partner, for books you can't get in a bookstore but not yet out of copyright.

Q: (gene koo) The publishers are ok with it but the non-profit book association has problems with it?
A: For publishers, it's another way of promoting their books. They have Onyx Feeds in XML that promote their books. Libraries have been much more difficult, primarily because of the complicated bureaucracy and concerns about legal issues. It's been a long hard slog to persuade them to give us their records. Can any librarians here give us advice?

Q: International?
A: We're working on several countries. We know people in India. We're looking all the time for people who can help us with it.

Q: Are you working with delicious library, etc., to see if they can contribute?
A: We've been working mainly with LibraryThing.com. Delicious etc, generally aggregate existing library records.

Q: What are you doing to reach the social tipping point?
A: The plan is to do it in two phases. First, get the data into the right format. Second, we need to bring people in, getting them to contribute. We think that a lot will be pulled in through Google.

Q: (oliver goodenough) Money?
A: Mainly funded by the Internet Archive. We have a grant from California. We hope that long-term it will be funded through affiliate fees and some scanning on demand fees.

Q: What is the glue? I don't see a unique ID...
A: Working on it.

Q: (me) FRBR is pretty structured. But the number of ways we might want to connect things is open ended. How are you going to figure out the right way to have structured vs unstructured?
A: We'll start with something. We'll pick the ones we like. Then we hope the user community will emerge and figure out the right ways to categorize and connect.

Q: (tim spalding - librarything) Tagging allows for multiple categorizations and relationships. E.g., at librarything we got pressure to include more choices under gender. How to resolve?
A: Tough problem.
A: (terry martin) Some data is unambiguous. Author names should be unambiguous.
A: (aaron) It'd be good to have a shared point of view, as at Wikipedia.

Q: (sj) Are you hotlinking to any databases? I.e., not importing but doing calls.
A: When you have 10M records, you have to do the import. For price records, we'll do live queries.

Q: Frequently, wikipedia will put in a note to clarify ambiguous categorizations, e.g., a gender categorization that isn't right. But OL is more constrained
A: From the beginning we've faced the tension between reusable data and flexibility. Our compromise is that things are structured but can be changed on the fly for an individual entry or class of entries. The hope is that people don't change the names of the fields so the database remains reliable.

Q: (Terry martin) Greg Crain, 25 yrs ago you did something like this for a closed domain. Would you do it this way now?
A: (Greg) People don't care about books. They care about a poem or a chapter. Most of the world's expertise is distributed. How to take advantage of the distributed labor. Tricky question. Not just a means but an end. Wikipedia is the dog and the academy is the tail. How do you integrate the two? And it's not books, it's objects. E.g., we're dealing with the European museum classification system. The general issue is how you add more structure within the book.
A: (aaron) That's the hope. And it certainly comes up with journal articles, and songs where you want to point to a song within an album.
A: (greg) The important thing about what you're doing is that it's open.

Q: (sj) What about unpublished works?
A: You can scan them and upload the metadata. There's a bit of question about what belongs in the OL library, but we're not in a position to kick things out. Maybe we'll have metadata indicating that it's not a "real" book.
A: (oliver) This could become a self-publishing system.

Q: (me) And then doesn't it get spammed as people link their self-published book to existing books?
A: It's the Internet. Everything is spammed. If it happens, there will be spam fighters.

Q: Why won't OCLC give you the data?
A: We'd take it in any form. We'd be willing to pay. Getting through the library bureaucracy is difficult...
A: (terry) You need to find the right person at OCLC
A: We've talked with them at a high level and they won't give us any information. Too bad since they're a non-profit. Library records are not copyrightable. OCLC contractually binds libraries.

Q: (tim) The greatest thing about OL is that it's an OCLC killer. Libraries shouldn't pay for it. Why not just explicitly say that the enormous value is that libraries won't have to pay for cataloging records.
A: (librarian) Who's going to create the records?
A: They're created already. We just need to get a couple of libraries to provide their collections.

Q: (sj) OCLC culls and curates. OL will need this.
A: I'd love to talk about this with the OCLC more. Their mission is the same as ours, but they have this enormous revenue stream from the records. They've gotten more open maybe partially in response to us.

A: Why not just give OL the records?
Q: (terry) Because we have them from OCLC and we're contractually bound.
A: There's an exemption for providing them to non-profits. A: (terry) Hmm. Maybe. It includes lots of journal records. But where does it take us? Do you have out of copyright books? I'm not particularly interested in promoting in-print commercial books.
A: Yes. Publishers are happy to hand over in-print data. The struggle is getting out of print books. Everyone at the project is more interested in out of print books. We want to pull people from the latest, hottest thing to the older and more interesting books. We're happy to link to already scanned collections.

Even if contracts allow you to distribute your records, wouldn't that annoy OCLC?
A: (terry) Nah.

Q: (sjklein) What happened to Wikicat?
A: It seems kind of dead.

A: How do you plan on promoting it once you open it up?
Q: We want to get ranked highly in Google. We're also talking about a partnership with Wikipedia. Right now, citing a book in Wikipedia is complex. We're working on letting you just search at OL and it populates the record.

Q: You will have solved the age old problem of where the ISBN number points to.

Q: (me) What do you need to succeed?
A: More data. More people contributing. More book lovers, like at LibraryThing.com. And a few more programmers. [Tags: everything_is_miscellaneous libraries taxonomy books categorization_metadata oclc isbn ]

Posted by D. Weinberger at October 23, 2007 03:26 PM


Comments

Good meeting f2f, David. I'm pretty excited by Open Library, from an end-user perspective, anyway. The librarians at the meeting seemed a little leery.

Posted by: Josh Glenn | October 23, 2007 08:16 PM


awesome.

hoping they get connected with the Distributed Proofreaders project, which is another source of data and metadata and scans.

Posted by: Edward Vielmetti | October 24, 2007 01:45 AM


Awesome project. I've thought for a while that there needs to be an equivalent project for audio and video. Something that documents, collates and provides download links for every bit of audio and video that's ever been recorded. Both areas have content that is on media that is self destructing, that is out of print and which is falling out of copyright. It needs to be preserved and made available again.

Posted by: Julian Bond | October 24, 2007 02:47 AM


I won't argue either way about the closed-ness of OCLC's records as an aggregate set, but individually, libraries can share their records. see:

http://www.oclc.org/support/documentation/worldcat/records/guidelines/default.htm

Posted by: K.G. Schneider | October 24, 2007 09:17 PM


Oct. 3, 2007:

Me: Aaron, we should talk. OCLC wants to see how it can participate in OL.

Aaron: I'd love to do a call sometime; unfortunately I need to pack for a trip right now.

Posted by: eric hellman | October 25, 2007 04:15 PM


Post a comment

Guidelines for Commenting

Basically, you can say what you want. (Click here for the fine print.)

If you haven't left a comment here before, your comment may be put into a queue for me to approve. Sorry for the delay. Blame the damn spammers.