Joho the Blog » taxonomy

April 27, 2010

[berkman] Luis von Ahn on free lunches, captcha, and tags

Luis von Ahn of Carnegie Mellon University is giving a Berkman lunchtime talk. [NOTE: I'm liveblogging. I'm making mistakes, leaving stuff out, paraphrasing, getting things wrong. This is an unreliable record.]

Luis invented captchas, the random characters you have to type in to convince a web page that you are a human and not a hostile software program. (He shows randomly generated sequences that happened to spell out “wait” and “restart.”) Captchas are useful, he says, when you’re trying to prevent people from gaming a system by writing a program to enter data robotically. They’re also useful to prevent spammers from signing up for free email accounts. To get around this, spammers have started up sweat shops where humans type captchas all day long; it costs the spammers about $0.33/account. And some porn companies ask users to type in a captcha to see photos; the captchas are drawn from email account applications. Damn clever!

He shows some variants. A Russian asks you to solve a mathematical limit. In India one asks you to solve a circuit. Luis says these aren’t all that effective because compputers can solve both problems, but they’re still better than the “what is 1 + 1?” captchas he’s found on US sites.

He says that about 200M captchas are typed every day. He was proud of that until he realized it takes about 10 seconds to type them, so his invention is wasting 500,000 hours per day. So, he wondered if there was a way to use captchas to solve some humungous problem ten seconds at a time. result: ReCAPTCHA. For books written before 1900, the type is weak and about 30% of the text cannot be recognized by OCR. So, now many captchas ask you to type in a word unrecognized when OCR’ing a book. (The system knows which words are unrecognized by running multiple OCR programs; ReCAPTCHA uses those words.) To make sure that it’s not a software program typing in random words, ReCAPTCHA shows the user two words, one of which is known to be right. The user has to type in both, but doesn’t know which is which. If the user types in the known word correctly, the system knows it’s not dealing with a robot, and that the user probably got the unknown word right.

ReCAPTCHA is a free service. Sites that use it have to feed back the entries for the unknown word. About 125,000 sites use it. They’re doing about 70M words per day, the equivalent of 2-4M books per year. If the growth continues, they’ll run out of books in 7 years, but Luis doesn’t think the growth will continue, so it might take twenty years. (There are 100M books.)

(In response to a backchannel question, Luis tells the penis captcha story.)

The ReCAPTCHA system filters out nationalities, known insult terms, and the like, to avoid unfortunate juxtapositions. It’s soon going to be released in 40 languages. Google acquired ReCAPTCHA.

Q: When will OCR be good enough to break captchas?
A: I don’t know. We’ll probably run out of books first.

Q: Business model?,br>
A: Google Books gets help digitizing.

ReCAPTCHA “reuses wasted human processing power.” The average American spends 1.9 seconds per day typing captchas. We also spend 1.1 hours a day playing electronic games. We humans spent 9B hours spending in 2003. It took less than a day of that to build the Panama Canal. So, Luis switches topics a bit to talk about how to solve human problems by playing games.

First is tagging images with words. Image search works by looking at file names and html text, because computers can’t yet recognize objects in images very well.

Does typing two words take twice as long as typing random letters? No, it takes about the same time, he says. Luis says about 10% of the world’s population have typed in a captcha. The ESP game asks two people unknown to each other to label an image until they agree. The game taboos words that other players have already agreed on. The system passes images through until they get no new labels. They’ve gotten over 50M agreements. 5,000 players playing simultaneous could label all Google images in a month. Google has itsown version; Google has an exclusive license to the patent.

Q: Demographics?
A: For my version, average age is 29 (with huge variance), evenly split between women and men.

Q: Compared to Flickr tags?
A: Only a small fraction of Flickr images have useful tags. The tags from flickr tend to be significantly more exact, but also significantly noisier (e.g., a person tagging an image in a way that means something idiosyncratic).

Q: Bots?
A: Yes, we don’t want you to wait for a partner, so sometimes we’ll give you a bot that replays the moves a human had made with the same image.

Q: Google Images benefits from its version of your game. Who benefits from your version of the game?
A: No one.

For some images, guesses change over time. E.g., a Britney Spears photo five years ago got labels like britney and hot. About two years ago, the labels changed to crazy, rehab, and shaved head. Now they’re back to britney and hot. By watching a player for 15 mins, you can guess whether the player is male or female with 95-98% accuracy.

Why do people like the ESP game? Sometimes they feel an intimacy with their partners. They have to step outside of themselves to make the match. They can have a sense of achievement.

He ends by saying that the about the same number of people — 100,000 — have worked on humanity’s big projects, e.g., pyramids, Panama Canal, putting a person on the moon. That’s in part (he says) because it is so hard to coordinate large numbers of people. Now we can get 100M people to work on something. What can we do?


November 15, 2009

OMG. I disagree with Umberto Eco!

It makes me very nervous to disagree with Umberto Eco because he is so fathomlessly smart. But I think in this case I do. Sort of.

There’s a fabulous interview with Eco in Spiegel (in English) about why he loves lists. He is characteristically pithy, provocative and wise. A crucial paragraph, from the beginning:

The list is the origin of culture. It’s part of the history of art and literature. What does culture want? To make infinity comprehensible. It also wants to create order — not always, but often. And how, as a human being, does one face infinity? How does one attempt to grasp the incomprehensible? Through lists, through catalogs, through collections in museums and through encyclopedias and dictionaries. There is an allure to enumerating how many women Don Giovanni slept with: It was 2,063, at least according to Mozart’s librettist, Lorenzo da Ponte. We also have completely practical lists — the shopping list, the will, the menu — that are also cultural achievements in their own right.

I read the first sentence and was provoked, as Eco intends. Lists are the origin of culture? Please say more! But Eco doesn’t really explain, in this interview, why lists — as opposed to other forms of collections and orderings — are so important. The urge to make order, yes, but not lists themselves.

A list is one particular way of creating order. Lists are sequential and one-dimensional: Wines listed by year, or by place, or by ranking, or by the chronology of when you first encountered them. (Lists can be hierarchical, but they’re only lists if they can be resolved back down to the one-dimensional.) Lists thus are one elemental way of ordering the world. And they have a peculiar fascination, which Eco expresses beautifully. But I think it’s wrong to say that they’re the origin of culture. I think it’d be more accurate and useful to say that culture originates with collecting: Pulling things around us because of their appeal (a word I’m purposefully leaving vague).

I’m sure I’m making too much of Eco essentially drumming of interest in his exhibit at the Louvre, but the issue matters a little bit. I think (based on little to nothing) that lists emerged as a stripping down of multi-dimensional collections. Culture first happened (I imagine) when we pulled together pieces of the world that spoke to us in ways we could not articulate. We assembled them as spaces through which we could wander, or piles through which we could collectively sort (“Oooh, I particularly like that green shiny stone!”). Lists are an abstraction, and culture began (I suppose) with an unarticulated sense that some things go together — and perhaps our first conversations were about why.

Eco goes on to say many wonderful things about why we have liked lists, including proposing that listing properties of an object can liberate us from looking for the definitional essence of things. (For more on this, read his important book, Kant and the Platypus.) In fact, Eco suggests that a mother defines a tiger to her child “Probably by using a list of characteristics: The tiger is big, a cat, yellow, striped and strong.”

I have a bunch of issues with that.

First, that type of definition really just makes explicit what’s implicit in the traditional approach to definitions as essence. In the traditional Aristotelian approach, the essence is the creature’s spot in the hierarchy of beings. So, a tiger is a species of cat, and thus would be specified by its difference from other cats but also by all of the properties of the classes above it (mammal, vertebrate, animal, etc.). The essential definition and the list definition both consist of a list of properties, but the essential definition nests them so that they don’t all have to be spelled out, and so we can see which differences “count.” Eco says, “The essential definition is primitive compared with the list,” but it seems to me that a beautifully nested, hierarchical system of essential definitions is in fact more advanced — it requires abstraction and systems thinking — than a mere list.

But, I don’t want to miss Eco’s essential (so to speak) point here, which is that defining something with a list breaks us out of the notion that there is a single, knowable essence. Absolutely. There’s no eternal essence, “just” a set of properties that are relevant depending upon our circumstances. With that I wholeheartedly agree.

My second problem with this is that — as George Lakoff says in Women, Fire and Dangerous Things, explicating and expanding the work of Eleanor Rosch — the mother (heck, maybe even the father) probably actually teaches the child what a tiger is by pointing at one, or at a picture of one. We learn through prototypes, not through essential definitions, and not by making lists. List-making is an abstraction and a secondary activity.

Third, the listing the parent does seem to me to not have the properties that make lists captivating to Eco. The parent isn’t trying to give a complete listing that brings a sense of mastery over the infinite and over death. She’s just pointing out some of the salient features. If it is a list, it’s not a list of the sort that Eco has charmed us about.

Fourth, while lists of properties are a useful corrective to thinking that things are exhausted by a definition of their essence, lists strip out so much that they don’t seem like much more adequate than essential definitions. A tiger isn’t a list.

This is just a fun interview in Spiegel, so I may be taking it too seriously. So, even if lists occur within culture — including the lists in literature he points to — rather than being the origin of culture, the interview does indeed help us to see why our fascination with lists is a fascination with something bigger than lists.


November 12, 2009

Lego blocks unmiscellanized

Giles Turnbull at the Morning News reports on his research interrogating (gently) children from different families about what they call various Lego pieces. Quite interesting in its own taxonomic way, and a topic that’s amusing even just to contemplate.

Be the first to comment »

October 11, 2009

Net uncovers new type of cloud

There are reports of a new type of cloud, one that is not currently in the official International Cloud Atlas. Or, possibly, it is a formation that’s been around forever, but the scattered reports are only now coalescing thanks to the Net.

According to Amazon’s review of Richard Hamblyn’s The Invention of Clouds, we only began thinking clouds could be categorized in 1802 when Luke Howard started giving public lectures. The very idea that clouds — the paradigm of uncatchable — could be divided into groups was (apparently) fascinating and thrilling. (Lamarck had also categorized clouds, but it didn’t catch on.)

A quick googly scan makes it seem that the cloud taxonomy is pretty messy. For example, the University of Illinois’ “cloud types” page lists four broad categories, and a list of miscellaneous clouds, each of which is categorized under one of the four basic types, evoking a “Huh?” reaction from at least one of us. The cloud taxonomy page at Univ. Missouri-Columbia lists eight types. Do you categorize by what they look like, how high they are, what they do (rain or not?), which celebrity profiles they resemble …? Categorizing clouds is truly a Borgesian task.

And, dammit, wouldn’t you know? Here’s a poem by Jorge Luis Borges called: “Clouds (II)” (with the line-endings probably removed):

Placid mountains meander through the air, or tragic cordilleras cast a pall, overshadowing the day. They are what we call clouds. And their shapes are often strange and rare. Shakespeare observed one once. It seemed to be a dragon. That one cloud of an afternoon still kindles in his words and blazes down, so that we go on seeing it today. What are the clouds? An architecture of chance? Perhaps they are the necessary things from which God weaves his vast imaginings, threads of a web of infinite expanse. Maybe the cloud is emptiness returning, just like the man who watches it this morning.

(translated by Richard Barnes. B; Robert Mezey; Richard Barnes. “Clouds (II). (poem).” The American Poetry Review. World Poetry, Inc. 1996. HighBeam Research. 11 Oct. 2009 v)

More Borges poems


September 13, 2009

From Technorati to WordPress tag namespace

The excessively sharp-eyed of you may have noticed that I have recently switch from listing tags at the end of posts to using WordPress tags at the end of posts. Here’s why. Not that you should care.

When tagging first took off, there weren’t a lot of good places to link your tags to. So, I chose to have them link to Technorati because Technorati was then the leading search engine for blogs. Plus, Technorati had taken the lead in making itself tag-worthy. Plus, Technorati was founded by a friend of mine — David Sifry — who I trusted (and still do trust) to do the Right Thing. Also, I was on the Technorati board of advisers (uncompensated), so I had some basic familiarity with the site and the the people. As a result, when you click on one of my old-style tags, it does a search for tags at Technorati and shows you the results. For example, here’s a tag to try: [Tags: ].

A couple of years ago, Word Press — the blogging software I use — introduced its own tagging capability. Instead of my having to hand-create links to the tags I want to use (actually, I wrote a little javascript to do it for me), I can enter tags and Word Press will turn them into links that aggregate all of my own postings that I’ve tagged that way. At the bottom of this post, you can try out the taxonomy link.

This is a further step into narcissism, for rather than seeing what the rest of the world has tagged “e-gov” (or whatever), you now see only my posts tagged that way. But I suspect that is probably what most users expect and want when they click on a tag at the bottom of a post. If you want to search all posts by everyone that have a certain tag, Technorati and other sites will do it for you.

(By the way, many thanks to Brad Sucks for writing the scripts that extracted my old tags and auto-inserted them as Word Press tags. He says the scripts are too focused to be of general use, so don’t ask. But do buy his music.)


August 26, 2009

Encyclopedia of Life – Now by Humans!

The Encyclopedia of Life is encouraging citizen contributions to its experts-vetted pages, so far with what seem like excellent results. There’s a good article about this at Science Daily. After two years, they’ve got 150,000 species pages underway, with 1.4 million stubs awaiting drafting.

[Tags: ]


August 19, 2009

Dilbert goes miscellaneous

Amusing Dilbert today, for those who can’t resist a good taxonomy joke. (Thanks for the tip, Helena!)

[Tags: ]

1 Comment »

August 11, 2009

The universality of names

There’s a terrific article by Carol Kaesuk Yoon in the NY Times about research that shows that humans around the world tend to cluster the natural world in highly similar ways, even using similar-ish names.

[Tags: ]

1 Comment »

July 26, 2009

The Guardian on miscellaneous bookshelves

The Guardian has fun article on schemes for arranging the books on your shelf, with an interesting set of comments. (It makes me want to send the entire thread a copy of Everything Is Miscellaneous.)

[Tags: ]

Be the first to comment »

July 11, 2009

Reslicing publications

The OCLC has an experimental site up that provides classification information for books and pubs. You type in the book’s title and author (or ISBN number, or other such ID), and it returns info about the various editions and how they’re classified in the OCLC’s Dewey Decimal Classification System or by the Library of Congress. You can then see the other books that share its Dewey Decimal number (for example, here’s Everything Is Miscellaneous, #303.4833>>Social sciences>>Social sciences, sociology & anthropology>>Social processes), at the OCLC’s useful Dewey Browser. Alas, when you click on the Library of Congress number, you get taken to a demand by the LC that you subscribe to Classification Web, instead of to the free LC Catalog (where my Misc book is listed like this).

Lots of metadata about the metadata…Gotta love it!

[Tags: ]


Next Page »

Switch to our mobile site