Taxonomies and Tags
From Trees to Piles of Leaves

 This is the introductory section of the new issue of Esther Dyson's Release 1.0 I wrote. The article goes on to talk about some companies doing interesting things in this area, including Yahoo, Corbis, ClearForest, Chandler, the Dewey Decimal Classification, Endeca, Siderean, NYTimes.com, del.icio.us, Flickr, Wikipedia, frassle and Technorati. If you'd like to buy the issue, click here. You can subscribe here. (Thanks to Esther and Christina Koukkos for permission to post this and for being insightful and steady-handed editors.)
-- David Weinberger (blog, home)
[Lightly edited on Jan. 20, 2006]

The Three Orders

The narrative that tells of the first man and woman encountering the tree of knowledge focuses on its tempting fruit. But after we took the bite, we apparently looked up and got the idea that knowledge is shaped like the tree's branching structure: Big concepts contain smaller ones that contain smaller ones yet. Over the millennia, we have fashioned the structures of knowledge in just such tree-like ways, from the departmental organization of universities (liberal arts contains history and history contains ancient Chinese history) to the hierarchy of species. The idea that knowledge is shaped like a tree is perhaps our oldest knowledge about knowledge.

Now autumn has come to the forest of knowledge, thanks to the digital revolution. The leaves are falling and the trees are looking bare. We are discovering that traditional knowledge hierarchies that have served us so well are unnecessarily restricted when it comes to organizing information in the digital world. The principles of organization themselves are changing now that they are being freed from the constraints of the physical world. For example:

  • In the physical world, a fruit can hang from only one branch. In the digital world, objects can easily be classified in dozens or even hundreds of different categories.
  • In the real world, multiple people use any one tree. In the digital world, there can be a different tree for each person.
  • In the real world, the person who owns the information generally also owns and controls the tree that organizes that information. In the digital world, users can control the organization of information owned by others. (Exception to the rule: Westlaw owns the standard organization of case law even though the case law itself is in the public domain.) 

These differences are so substantial that we can think of intellectual order as entering a third age. In the first, we organized the things themselves: We put books on shelves and silverware into drawers. In the second, we physically separated the metadata from the data: We built card catalogs and drew diagrams. In the third, the data and the metadata are digital, untying organization from the strictures of the physical world. In response, we are rapidly inventing new principles and tools of organization. When it comes to innovation on the Internet, metadata is becoming the new content.

But traditional taxonomic trees aren't something we can throw away without a thought. They are an amazingly efficient way of organizing complexity because they enable us to focus on one aspect (e.g., that's an apple) while keeping a universe of context (it's a fruit, part of a plant, a type of living thing) in the background, ready for access. Tree structures are built into our institutions. They may even be built into our genes. So we are in a confusing and fertile period as we try to sort out what works and what doesn't. Without trees, how would we organize college curricula, business org charts, the local library, and the order of species? How will we organize knowledge itself?

We may be on the path to finding out.

Webogeny recapitulates ontogeny

The tree of knowledge has roots, of course. They go back to Aristotle, who figured out how knowledge could be nested without having to claim that the container (say, the concept of human-ness) is the same sort of thing as what it contains (all existing humans). The individual items in a hierarchy inherit the properties of all the categories above it, so that if you know that Alcibiades is a human, you also know that he is a mammal and an animal. Inheritance provides a context by which the individual accretes the accumulated wisdom of the tree just by hanging on a particular branch -- an amazingly efficient way of expressing knowledge.

Five hundred years later the Syrian philosopher Porphyry first drew Aristotle's system of nested concepts as a tree.  That notion stuck, implicitly endorsed by Carl Linnaeus and Charles Darwin in the sciences, Francis Bacon in philosophy, and by libraries and academic departments just about everywhere.

The next stop in this story is Postmodernism's insistence that trees of knowledge are reflections of particular cultural assumptions and, importantly, conflate knowledge and power. You can't read Michel Foucault's The Order of Things and believe that order itself has no history. And not just French philosophers have given up on the old dream of finding a single, universal, comprehensive way of organizing the world's knowledge. You can't come out of Geoffrey C. Bowker and Susan Leigh Star's study of the International Classification of Diseases, Sorting Things Out, thinking that classification systems are value-free and objectively true. Nor can you look at the US Census' 2000 decision to expand the number of possible races without seeing that taxonomies can have enormous political and budgetary consequences.

The brief history of the Web has recapitulated Western culture's ontogeny of trees. Yahoo!'s directory tree became the early center of the Web, each leaf hand-selected and placed into categories designed initially by two computer science grad students at Stanford. But text search engines — AltaVista, HotBot, Google — dethroned Yahoo! as the Monarch of Search,  and Yahoo! in turn has moved its browsable tree below the fold on its home page.

When text search isn't the right solution — for example, at e-commerce sites where people may not know the names of the products they're looking for — a more dynamic way of creating and presenting trees, called faceted classification, is coming into its own. Invented in the early 1930s by Shiyali Ranganathan, an Indian librarian, it applies a pre-defined set of parameters (or facets) to its objects. For example, watches might have facets such as manufacturer, digital or analog, men's or women's, price, and electric or spring-driven. Some facets are a set of possible values (such as a pick-list of available manufacturers); others are a range of numerical values (such as price range). Users can then browse by selecting first on, say, digital or analog and then by price, or first by price and then by men's or women's. Users can drill down as they do with a normal tree, but the arrangement of the branches is dynamic and reflects the users' interests, not the store's. The store may not like it that you've routed around the $25,000 Rolex they're offering on sale for a mere $24,000, but you've found your $50, waterproof, analog watch much faster.

Faceted classification still presents users with a hierarchical tree, making it easy for them to browse to what they want. But unlike traditional trees, faceted systems don't decide beforehand how the branches are arranged. For example, if an ice cream stand organized its "customer experience" around a traditional hierarchical taxonomy — a tree — it might have a customer first choose between two flavors, then among three sizes, and finally between a cup or cone. There are 12 potential paths and exactly one path to a large cup of chocolate ice cream. In a faceted system, you could browse first by flavor, size, or container, resulting in 36 potential paths and three ways of getting to your large cup of chocolate. Faceted systems, like trees, enable users to navigate by continually focusing their interests, but users get to decide how their interests are structured. This makes faceted systems very useful where there are lots of items with easily specifiable properties and users whose ways of browsing are difficult to predict, such as a parts catalog.

The long tail of tags

Tags have become the meme of the year, at least so far, writing another chapter in the history of classification systems. Tagging is an old idea, but it seems to be taking off now because some applications provide end-users with immediate benefits. For example, at del.icio.us, users enter bookmarks (URLs) they want to remember, adding a word or two — tags — so they can sort them later. Del.icio.us users can see not only everyone else's bookmarks, but also all the bookmarks tagged with a particular word. For example, if you care about Emily Dickinson, you can see all the Web pages del.icio.us users have tagged with "Dickinson" or "Emily Dickinson," a great tool for researchers.

Traditionally, people have been loath to attach metadata to objects, because it felt like a chore without immediate benefit. At del.icio.us and other sites such as Flickr, a photo-sharing site, there is a strong social benefit to tagging: We get to contribute to, and benefit from, the tagging done by others. To lower the hurdle and encourage tagging, both sites allow us to type in any word we want, rather than forcing us to navigate some hierarchical, controlled vocabulary. Of course, that also makes it far harder to find relevant objects: There's no immediate way to tell whether a photo tagged with "apple" shows a fruit or a computer. Plus, a search for photos tagged with "apple" will miss relevant photos tagged as "GrannySmith."

Tags are a break from previous ways of categorizing. Both trees and faceted systems specify the categories, or facets, ahead of time. They both present users with tree-like structures for navigation, letting us climb down branches to get to the leaf we're looking for. Tagging instead creates piles of leaves in the hope that someone will figure out ways of putting them to use — perhaps by hanging them on trees, but perhaps creating other useful ways of sorting, categorizing and arranging them.

Even in these early days of tagging, we're seeing self-organizing taxonomies emerge from the piles. For example, if you're tagging a page about an Apple computer, you may notice that far more people use the tag "Mac" than "Macintosh." So, if you want lots of people to find the page, you will tag it "Mac." By using that tag, you have also increased the popularity and momentum of the "Mac" tag. The resulting bottom-up clusters of tags has been called a folksonomy. (It's also been called a "tagsonomy," but that's harder to differentiate from "taxonomy" when spoken aloud.)

Folksonomies stand in sharp contrast to both trees and faceted systems. First, folksonomies tend to be clusters of tags, not hierarchies: There's a pile of "apple" tags and another pile of "GrannySmith" tags, but the folksonomy may not recognize that the latter is a subset of the former. Hierarchies can sometimes be derived from folksonomies, but they don't have to be. Second, trees and faceted systems are designed ahead of time, usually by information professionals. Folksonomies grow organically. Third, trees and faceted systems are usually owned and controlled by the people who own the information being organized, whereas folksonomies are (so far) unowned and not centrally controlled. Fourth, trees and faceted systems drive out ambiguity. For example, take a page that in a tagging system carries the ambiguous tag "apple." In a tree or faceted system, the branch it hangs from would tell you whether the page is about computers or fruit — inheritance at work. Tagging systems are inherently ambiguous. Trees are neat; piles of leaves are messy.

Because of these differences, the three approaches are useful in different circumstances:

Because they are unambiguous, trees work well where information can be sharply delineated and is centrally controlled. Users are accustomed to browsing trees, so little or no end-user training is required. But trees are expensive to build and maintain and require the user to understand the subject area well: How do you find the recipe for bread soup if you don't know to look in the "Tuscan Cooking" category?

  • Faceted systems work splendidly where an application is being used by such a wide range of users that no one tree is going to match everyone’s way of thinking. They are also easier to maintain than trees because adding a new item requires only filling in the information about the facets, rather than having to make a decision about exactly which category it should go into.
  • Tagging systems are possible only if people are motivated to do more of the work themselves, for individual and/or social reasons. They are necessarily sloppy systems, so if it's crucial to find each and every object that has to do with, say, apples, tagging won't work. But for an inexpensive, easy way of using the wisdom of the crowd to make resources visible and sortable, there's nothing like tags.

The craft of creating and maintaining trees and faceted systems is well advanced and well understood. Businesses have been built around them. But we don't yet know the outcome of the current infatuation with tags. The potential is real: If tag-mania continues, it will provide a layer of new metadata, generated by humans for other humans, that will invoke innovation and businesses — and problems — we necessarily cannot anticipate.