Joho the Blog » [2b2k] Jon Orwant of Google Books

[2b2k] Jon Orwant of Google Books

Jon Orwant is an Engineering Manager at Google, with Google Books under him. He used to be CTO at O’Reilly, and was educated at MIT Media Lab. He’s giving a talk to Harvard’s librarians about his perspective on how libraries might change, a topic he says puts him out on a limb. Title of his talk: “Deriving the library from first principles.” If we were to start from scratch, would they look like today’s? He says no.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Part I: Trends.

He says it’s not controversial that patrons are accessing more info online. Foot traffic to libraries is going down. Library budgets are being squeezed. “Public libraries are definitely feeling the pinch” exactly when people have less discretionary money and thus are spending more time at libraries.

At MIT, Nicholas Negroponte contended in the early 1990s that telephones would switch from wired to wireless, and televisions would go from wired to wireless. “It seems obvious in retrospect.” At that time, Jon was doing his work using a Connection Machine, which consisted of 64K little computers. The wet-bar size device he shows provided a whopping 5gb of storage. The Media Lab lost its advantage of being able to provide high end computers since computing power has become widespread. So, Media Lab had to reinvent itself, to provide value as a physical location.

Is there an analogy to the Negroponte switch of telephone and TV, Jon asks? We used to use the library to search for books and talk about them at home. In the future, we’ll use our computer to search for books, and talk about them at our libraries.

What is the mission of libraries, he asks. Se;ect and preserve info, or disseminate it. Might libraries redefine themselves? But this depends on the type of library.

1. University libraries. U of Michigan moved its academic press into the library system, even though the press is the money-making arm.

2. Research libraries. Harvard’s Countway Medical Library incorporates a lab into it, the Center for Bioinformatics. This puts domain expertise and search experts together. And they put in the Warren Anatomical Museum (AKA Harvard’s Freak Museum). Maybe libraries should replicate this, adopting information-driven departments. The ideal learning environment might be a great professor’s office. That 1:1 instruction isn’t generally tenable, but why is it that the higher the level of education, the fewer books are in the learning environment? I.e., kindergarten classes are filled with books, but grad student classrooms have few.

3. Public libraries. They tend to be big open rooms, which is why you have to be quiet in them. What if the architecture were a series of smaller, specialized rooms? Henry Jenkins said about newspapers, Jon says, that it’s strange that hundreds of reporters cover the Superbowl, all writing basically the same story; newspapers should differentiate by geography. Might this notion of specialization apply to libraries, reflecting community interests at a more granular level. Too often, public libraries focus on lowest common denominator, but suppose unusual book collections could rotate like exhibits in museums, with local research experts giving advice and talks. [Turn public libraries into public non-degree based universities?]

Part 2: Software architecture

Google Books want to scan all books. Has done 12M out of the 120 works (which have 174 manifestations — different versions and editions, etc.). About 4B pages, 40+ libraries, 400 languages (“Three in Klingon”). Google Books is in the first stage: Scanning. Second: Scaling. Third: What do we do with all this? 20% are public domain.

He talks a bit about the scanning tech, which tries to correct for the inner curve of spines, keeps marginalia while removing dirt, doing OCR, etc. At O’Reilly, the job was to synthesize the elements; at Google, the job is to analyze them. They’re trying to recognize frontispieces, index pages, etc. He gives as a sample of the problem of recognizing italics: “Copyright is way too long to strike the balance between benefits to the author and the public. The entire raison d’etre of copyright is to strike a balance between benefits to the author and the public. Thus, the optimal copyright term is c(x) = 14(n + 1).” In each of these, italics indicates a different semantic point. Google is trying to algorithmically catch the author’s intent.

Physical proximity is good for low-latency apps, local caching, high-bandwidth communication, and immersive environments. So, maybe we’ll see books as applications (e.g., good for physics text that lets you play with problems, maybe not so useful for Plato), real-time video connections to others reading the same book, snazzy visualizations, presentation of lots of data in parallel (reviews, related books, commentary, and annotations).”

“We’ll be paying a lot more attention to annotations” as a culture. He shows a scan of a Chinese book that includes a fold-out piece that contains an annotation; that page is not a single rectangle. “What could we do with persistent annotations?” What could we do with annotations that have not gone through the peer review process? What if undergrads were able to annotate books in ways that their comments persisted for decades? Not everyone would choose to do this, he notes.

We can do new types of research now. If you want to know whether the past tense of “sneak” is, 50 yrs ago people would have said “snuck” but in 50 years it’ll be “sneaked.” You can see that there is a trend toward regularization of verbs (i.e., not irregular verbs) over the time, which you can see by examining the corpus of books Google makes available to researchers. Or, you can look at triplets of words and ask what are the distinctive trigrams. E.g., It was: oxide of lead, vexation of spirit, a striking proof. Now: lesbian and gay, the power elite, the poor countries. Steve Pinker is going to use the corpus to test the “Great man” theory. E.g., when Newton and Leibniz both invented the calculus, was the calculus in the air? Do a calculus word cloud in multiple languages and test against the word configurations of the time. The usage of phrases “World War I” and “The Great War” cross around 1938, but there were some people calling it “WWI” in 1932, which is a good way to discover a new book (wouldn’t you want to read the person who foresaw WWII?). This sort of research is one of the benefits of the Google Books settlement, he says. (He also says that he was both a plaintiff and defendant in the case because as an author, his book was scanned without authorization.)

The images of all the world’s books are about 100 petabytes. If you put terminals in libraries so anyone can access out of print books. You can let patrons print on demand. “Does that have an impact on collections” and budgets? Once that makes economic sense, then every library will “have” every single book.

How can we design a library for serendipity? The fact that books look different is appealing, Jon says. Maybe a library should buy lots and lots of different e-readers, in different form factors. The library could display info-rich electronic spines (graphics of spines) [Jon doesn't know that this is an idea the Harvard Law Library, with whom I'm working, is working on]. We could each have our own virtual rooms and bookshelves, with books that come through various analytics, including books that people I trust are reading. We could also generalize this by having the bookshelves change if more than one person in the room; maybe the topics get broader to find shared interests. We could have bookshelves for a community in general. Analytics of multifactor classification (subject, tone, bias, scholarliness, etc.) can increase “deep” serendipity.


Q: One of the concerns in the research and univ libraries is the ability to return to the evidence you’ve cited. Having many manifestations (= editions, etc.) lets scholars return. We need permanent ways of getting back to evidence at a particular time. E.g., Census Dept. makes corrections, which means people who ran analyses of the data get different answers afterward.
A: The glib answer: You just need better citation mechanisms. The more sophisticated answer: Anglo-Saxon scholars will hold up a palimpsest. I don’t have an answer, except for a pointer to George Mason conf where they’re trying to come up with a protocol for expressing uncertainty [I think I missed this point -- dw]. What are all the ways to point into a work? You want to think of the work as a container, with all the annotations that come up with it. The ideal container has the text itself, info extracted from it, the programs needed to do the extraction, and the annotations. This raises the issue of the persistence of digital media in general. “We need to get into the mindset of bundling it all together”: PDFs and TIFFs + the programs for reading them. [But don't the programs depend upon operating systems? - dw]

Q: Centralized vs. distributed repository models?
A: It gets into questions of rights. I’d love to see it as distributed to as many places and in as many formats as possible. It shouldn’t just be Google digitizing books. You can get 100 petabytes in a single room, and of course much smaller in the future. There are advantages to keeping things local. But for the in-copyright works, it’ll come down to how comfortable the holders feel that it’s “too annoying” for people to copy what they shouldn’t.

7 Responses to “[2b2k] Jon Orwant of Google Books”

  1. [...] Access the Live Blog Report [...]

  2. [...] google, librarians, libraries by Chris Big thanks to David Weinberger at JoHo the Blog for this liveblog of Jon Orwant’s talk to Harvard Librarians entitled “Deriving the Library from First [...]

  3. [...] Jon Orwant of Google Books — Jon’s an O’Reilly alum, and engineering manager for Google Books. David Weinberger liveblogged a talk Jon gave to Harvard librarians. Google Books want to scan all books. Has done 12M out of the 120 works (which have 174 manifestations — different versions and editions, etc.). About 4B pages, 40+ libraries, 400 languages (“Three in Klingon”). Google Books is in the first stage: Scanning. Second: Scaling. Third: What do we do with all this? 20% are public domain. [...]

  4. [...] [2b2k] Jon Orwant of Google Books ( [...]

  5. [...] Books, estudió en el MIT y fue Director Tecnológico en O’Reilly. Ayer dio una charla, un live blogging, a los bibliotecarios de Harvard, sobre su perspectiva de cómo las bibliotecas podrían cambiar. [...]

  6. [...] Talk to Harvard librarians by Google Book's (nee O'Reilly & MIT media lab) Jon Orwant bl… (tags: swhpl library joho mit google gbs future harvard march 2010) [...]

  7. [...] Books; estudió en el MIT y fue Director Tecnológico en O’Reilly. Ayer dio una charla, un live blogging, a los bibliotecarios de Harvard, sobre su perspectiva de cómo las bibliotecas podrían cambiar. [...]

Leave a Reply

Web Joho only

Comments (RSS).  RSS icon

Switch to our mobile site