Joho the Blog » too big to know

March 29, 2010

[2b2k] Jon Orwant of Google Books

Jon Orwant is an Engineering Manager at Google, with Google Books under him. He used to be CTO at O’Reilly, and was educated at MIT Media Lab. He’s giving a talk to Harvard’s librarians about his perspective on how libraries might change, a topic he says puts him out on a limb. Title of his talk: “Deriving the library from first principles.” If we were to start from scratch, would they look like today’s? He says no.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Part I: Trends.

He says it’s not controversial that patrons are accessing more info online. Foot traffic to libraries is going down. Library budgets are being squeezed. “Public libraries are definitely feeling the pinch” exactly when people have less discretionary money and thus are spending more time at libraries.

At MIT, Nicholas Negroponte contended in the early 1990s that telephones would switch from wired to wireless, and televisions would go from wired to wireless. “It seems obvious in retrospect.” At that time, Jon was doing his work using a Connection Machine, which consisted of 64K little computers. The wet-bar size device he shows provided a whopping 5gb of storage. The Media Lab lost its advantage of being able to provide high end computers since computing power has become widespread. So, Media Lab had to reinvent itself, to provide value as a physical location.

Is there an analogy to the Negroponte switch of telephone and TV, Jon asks? We used to use the library to search for books and talk about them at home. In the future, we’ll use our computer to search for books, and talk about them at our libraries.

What is the mission of libraries, he asks. Se;ect and preserve info, or disseminate it. Might libraries redefine themselves? But this depends on the type of library.

1. University libraries. U of Michigan moved its academic press into the library system, even though the press is the money-making arm.

2. Research libraries. Harvard’s Countway Medical Library incorporates a lab into it, the Center for Bioinformatics. This puts domain expertise and search experts together. And they put in the Warren Anatomical Museum (AKA Harvard’s Freak Museum). Maybe libraries should replicate this, adopting information-driven departments. The ideal learning environment might be a great professor’s office. That 1:1 instruction isn’t generally tenable, but why is it that the higher the level of education, the fewer books are in the learning environment? I.e., kindergarten classes are filled with books, but grad student classrooms have few.

3. Public libraries. They tend to be big open rooms, which is why you have to be quiet in them. What if the architecture were a series of smaller, specialized rooms? Henry Jenkins said about newspapers, Jon says, that it’s strange that hundreds of reporters cover the Superbowl, all writing basically the same story; newspapers should differentiate by geography. Might this notion of specialization apply to libraries, reflecting community interests at a more granular level. Too often, public libraries focus on lowest common denominator, but suppose unusual book collections could rotate like exhibits in museums, with local research experts giving advice and talks. [Turn public libraries into public non-degree based universities?]

Part 2: Software architecture

Google Books want to scan all books. Has done 12M out of the 120 works (which have 174 manifestations — different versions and editions, etc.). About 4B pages, 40+ libraries, 400 languages (“Three in Klingon”). Google Books is in the first stage: Scanning. Second: Scaling. Third: What do we do with all this? 20% are public domain.

He talks a bit about the scanning tech, which tries to correct for the inner curve of spines, keeps marginalia while removing dirt, doing OCR, etc. At O’Reilly, the job was to synthesize the elements; at Google, the job is to analyze them. They’re trying to recognize frontispieces, index pages, etc. He gives as a sample of the problem of recognizing italics: “Copyright is way too long to strike the balance between benefits to the author and the public. The entire raison d’etre of copyright is to strike a balance between benefits to the author and the public. Thus, the optimal copyright term is c(x) = 14(n + 1).” In each of these, italics indicates a different semantic point. Google is trying to algorithmically catch the author’s intent.

Physical proximity is good for low-latency apps, local caching, high-bandwidth communication, and immersive environments. So, maybe we’ll see books as applications (e.g., good for physics text that lets you play with problems, maybe not so useful for Plato), real-time video connections to others reading the same book, snazzy visualizations, presentation of lots of data in parallel (reviews, related books, commentary, and annotations).”

“We’ll be paying a lot more attention to annotations” as a culture. He shows a scan of a Chinese book that includes a fold-out piece that contains an annotation; that page is not a single rectangle. “What could we do with persistent annotations?” What could we do with annotations that have not gone through the peer review process? What if undergrads were able to annotate books in ways that their comments persisted for decades? Not everyone would choose to do this, he notes.

We can do new types of research now. If you want to know whether the past tense of “sneak” is, 50 yrs ago people would have said “snuck” but in 50 years it’ll be “sneaked.” You can see that there is a trend toward regularization of verbs (i.e., not irregular verbs) over the time, which you can see by examining the corpus of books Google makes available to researchers. Or, you can look at triplets of words and ask what are the distinctive trigrams. E.g., It was: oxide of lead, vexation of spirit, a striking proof. Now: lesbian and gay, the power elite, the poor countries. Steve Pinker is going to use the corpus to test the “Great man” theory. E.g., when Newton and Leibniz both invented the calculus, was the calculus in the air? Do a calculus word cloud in multiple languages and test against the word configurations of the time. The usage of phrases “World War I” and “The Great War” cross around 1938, but there were some people calling it “WWI” in 1932, which is a good way to discover a new book (wouldn’t you want to read the person who foresaw WWII?). This sort of research is one of the benefits of the Google Books settlement, he says. (He also says that he was both a plaintiff and defendant in the case because as an author, his book was scanned without authorization.)

The images of all the world’s books are about 100 petabytes. If you put terminals in libraries so anyone can access out of print books. You can let patrons print on demand. “Does that have an impact on collections” and budgets? Once that makes economic sense, then every library will “have” every single book.

How can we design a library for serendipity? The fact that books look different is appealing, Jon says. Maybe a library should buy lots and lots of different e-readers, in different form factors. The library could display info-rich electronic spines (graphics of spines) [Jon doesn't know that this is an idea the Harvard Law Library, with whom I'm working, is working on]. We could each have our own virtual rooms and bookshelves, with books that come through various analytics, including books that people I trust are reading. We could also generalize this by having the bookshelves change if more than one person in the room; maybe the topics get broader to find shared interests. We could have bookshelves for a community in general. Analytics of multifactor classification (subject, tone, bias, scholarliness, etc.) can increase “deep” serendipity.

Q&A

Q: One of the concerns in the research and univ libraries is the ability to return to the evidence you’ve cited. Having many manifestations (= editions, etc.) lets scholars return. We need permanent ways of getting back to evidence at a particular time. E.g., Census Dept. makes corrections, which means people who ran analyses of the data get different answers afterward.
A: The glib answer: You just need better citation mechanisms. The more sophisticated answer: Anglo-Saxon scholars will hold up a palimpsest. I don’t have an answer, except for a pointer to George Mason conf where they’re trying to come up with a protocol for expressing uncertainty [I think I missed this point -- dw]. What are all the ways to point into a work? You want to think of the work as a container, with all the annotations that come up with it. The ideal container has the text itself, info extracted from it, the programs needed to do the extraction, and the annotations. This raises the issue of the persistence of digital media in general. “We need to get into the mindset of bundling it all together”: PDFs and TIFFs + the programs for reading them. [But don't the programs depend upon operating systems? - dw]

Q: Centralized vs. distributed repository models?
A: It gets into questions of rights. I’d love to see it as distributed to as many places and in as many formats as possible. It shouldn’t just be Google digitizing books. You can get 100 petabytes in a single room, and of course much smaller in the future. There are advantages to keeping things local. But for the in-copyright works, it’ll come down to how comfortable the holders feel that it’s “too annoying” for people to copy what they shouldn’t.

7 Comments »

February 28, 2010

[2b2k] Another re-org

Last week, I went through the current (dis)organization of the book with Tim Sullivan, my editor at Basic Books. I’ve known Tim for a few of years, (even before he became the editor of the tenth anniversary edition of Cluetrain), which is the basic reason I went with Basic for Too Big to Know. Tim’s got a sharp eye for the structure of books, as well as being smart about, and fully engaged in, the content. Truly pleasure to work with.

Tim is the opposite of freaked out by my thrashing. In fact, he’s actually sort of encouraging about it, because (I assume) he sees it as one way the creative process proceeds. So, I came out of that conversation a little less freaked out myself.

Here’s where I am at the moment. I have a prologue that needs some work but Tim thinks sets up the problem well enough. It contrasts Darwin’s sort facts with those at Hunch.com, and tries to lead the reader to see not just that there is too much to know but that our new muchness seems to be changing the nature of knowledge itself. (My concern with the prologue is that I don’t want the reader to think that the book is about algorithmic learning, as the Hunch.com example might suggest.)

I’ve now re-done Chapter 1. It begins with a section on the data-information-knowledge pyramid as an example of our traditional strategy of dealing with the knowledge overload by narrowing our field of vision. Then I talk about information overload as a fact of life. I introduce Clay Shirky’s “It’s not information overload — it’s filter failure” idea, and then say that the difference is not simply that we now have social filters and the like. Rather, our filters now don’t filter out so much as filter forward — they reduce the number of clicks it takes to get to an item, but they leave the other items accessible. This puts the fact of overload straight into our faces. I close by suggesting a half dozen ways this affects knowledge, but I’m not sure I’ll keep that little section.

I’m working on Chapter 2, for the moment called “The expertise of clouds,” which was a leading contender for the title back when I was plotting the book. It looks like it may be a very long chapter on networked expertise.I’m not exactly sure how to organize it at the moment. The main question is whether I put into it all the multiple case studies and examples of networked expertise I’ve been accumulating.

I feel like I’m postponing facing the organizational problem posed by what I’m proposing as Chapter 3: the history and future of facts. (That’s the grandiose way of putting a much more mundane topic.) I’m afraid that chapter will strike the reader as unfocused and pointless. Why are we reading about the 19th century social reform movement in England? Beats me. But, thankfully I have Chapter 2 to distract me from that question.

2 Comments »

February 24, 2010

[2b2k] Eggs good for you this week

The title of this post is one of my favorite headlines from The Onion.

So, yesterday we’re told that maybe taking a baby aspirin every day is more harmful than helpful, except for those with certain heart disease heart factors. (My doctor has me on ‘em. I’m going to keep take them.)

Today, an article in the Boston Globe reports on a study that says saturate fats don’t clog arteries the way we’ve been told for generations. (In the 1930s, when my grandfather had a heart attack, my grandmother was told to make sure he eats lots and lots of butter, to keep anything from sticking to his arteries.)

So, what will they take back tomorrow? Germ theory? Gravity? Heliocentrism? Bring back phlogiston!

7 Comments »

February 23, 2010

[2b2k] Tuttle Club’s expertise clubhouse

Here’s a post from last July — ok, so I’m a little behind in my reading — that describes the Tuttle Club’s first consulting engagement. An open, self-selected group of people converge for an open session with the potential client. They talk, sketch, and do some improv, out of which emerges a set of topics and people for more focused discussion.

This is semi-emergent expertise. I add the “semi” because the initial starting conditions are quite focused, so the potential areas of collaboration and outcomes are thus fairly constrained. But compared to traditional Calf Sock Expertise (i.e., highly paid and trained men in blue suits who believe that focus is the only efficient way to proceed), this is wildly emergent.

Be the first to comment »

January 7, 2010

Cellphones prevent Alzheimer’s

My friend Hilly Besdin sent me this link to an article in Medical News Today titled Cell Phone Waves Protected Mice Against Alzheimer’s, Reversed Memory Damage.

Hilly also makes the appropriate connection: It could go straight into the scene in Woody Allen’s Sleeper in which in the future we’ve learned that smoking is “one of the healthiest things for your body.” (I also like — and will use in Too Big to Know — The Onion headline: “Eggs good for you this week.”)

Be the first to comment »

January 2, 2010

<2b2k> Almost complete first draft of Chapter 1

And when I say “first draft,” what I actually mean is the fifth draft of the first draft. Even that’s not right, since I go through the chapter continuously, and create a new draft (or what I should perhaps call a “version”) whenever I’m about to make a big change I think I may regret.

Anyway, I think and hope that it’s in roughly the shape it needs to be in, although I’ll re-read it tomorrow and may decide to scrap it. And when I’ve finished the last chapter, I may well see that I need throw out this one and begin again. Life in the book writing biz.

There are definitely things I don’t like about the current version. For example, the beginning. And the ending. Also, some stuff in the middle.

The current draft begins with the question “If we didn’t have a word for knowledge, would we feel the need to create one?” I don’t answer that in this chapter. I’m thinking I’ll come back to it at the end of the book. Instead, I quickly go through some of the obvious reasons we’d answer “yes.” But then I need to suggest that the answer might be “no,” and I don’t think I do a good enough job on that. It’s difficult, because the whole book whittles away at that answer, so it’s hard to come up with a context-free couple of paragraphs that will do the job. I want this chapter to focus on the nature of knowledge as a structure, so I contrast traditional guide books with the open-endedness of the Web, hoping to suggest that knowledge has gotten too big to be thought of as structure or even as a realm. (I can only hint at this at this point.) But, the Web example seems so old hat to me that I even have to apologize for it in the text (“Just another day on the Web…”). I’d rather open by having me in some actual place that I can write about — someplace where I can point to obvious features that are only obvious because we make non-obvious assumptions about the finitude, structure, and know-ability of knowledge. A library? I’d like to think of something more novel.

Since I last updated this blog about my “progress,” I’ve added a section on the data-information-knowledge-wisdom hierarchy, which traces back to T.S. Eliot. I glom onto some of the definitions of “knowledge” proposed by those who promulgate that hierarchy and point out that they have little to do with what we usually mean by knowledge (and what Eliot meant by it); rather they slap the label “knowledge” on whatever seems to be the justification for investing in information processing equipment. I then swerve from giving my own definition — a swerve I should justify more explicitly — and instead spend some time describing the nature of traditional knowledge. The result of that section is that we think of knowledge as something built on firm foundations. These days, we take facts as the bricks of knowledge. But it wasn’t always so. And that I hope leads the reader smoothly enough into a discussion of the history of fact-based knowledge (which I’m maintaining really came into its own in the early 19th century British social reform movement).

I also added a brief bit about what non-fact-based knowledge looked like. I’d already discussed the medieval idea of assembling knowledge based on analogies, but I wanted to give a more modern example. So, I looked at Malthus, whose big book came out in 1798. I was disappointed to find that Malthus’ book is full of learned discussions of statistics and facts, and thus not only wasn’t a suitable example but seemed to disprove my thesis. Then I realized I was looking at the 6th edition. Malthus revised and republished his book for the next thirty years or so. If you compare the 6th edition with the first, you are struck by how stat-free edition #1 is and how stat-full #6 is. The first edition is a deductive argument based on seemingly self-evident propositions. The support he gives for his conclusion is based on anthropological sketches and guesses about why various populations have been kept in check. The difference between #1 and #6 actually helps my case.

The last section now introduces the idea of “knowledge overload” (which is still distressingly vague and I may have to drop it) and foreshadows some of the changes that overload is bringing. I’m having trouble getting the foreshadowing right, though, since it requires stating themes that will take entire chapters to unpack.

So, having obsessively worked on this every day for the past few weeks with no days off from it, I’m going to let it sit for a day or two. I think I’ll start sketching Chapter 2.

6 Comments »

December 21, 2009

[2b2k] Struggling with who cares

Yesterday I wrote a little — which will probably turn out to be too much — about the history of fact-finding missions. They’re really quite new, becoming a conspicuous part of international dispute settlement only with the creation of The Hague Convention in 1899. If you do a search on the phrase at the NY Times, you’ll see that there are only intermittent references until the 1920s when suddenly there are lots of them.

It strikes me as odd that we didn’t always have fact-finding missions, which is why I find it interesting. But I don’t think I can convince the reader that it’s interesting, which is why I’ve probably gone on too long about them. (There were obviously previous times when we tried to ascertain facts, but the phrase and the institutionalizing of fact-finding missions or commissions is what’s relatively new.)

Today I’m thinking I really need to shore up the opening section of this first chapter in order to show why the next section (on the history of facts, including fact-finding missions) matters. I think I’ll try to do that by briefly sketching our normal “architecture” of knowledge. For this it’d be good to come up with an easy example. Working on it…

3 Comments »