Joho the Blog » taxonomy

December 14, 2013

Are tags over-rated?

Jeff Atwood [twitter:codinghorror] , a founder of Stackoverflow and Discourse.org — two of my favorite sites — is on a tear about tags. Here are his two tweets that started the discussion:

I am deeply ambivalent about tags as a panacea based on my experience with them at Stack Overflow/Exchange. Example: pic.twitter.com/AA3Y1NNCV9

Here’s a detweetified version of the four-part tweet I posted in reply:

Jeff’s right that tags are not a panacea, but who said they were? They’re a tool (frequently most useful when combined with an old-fashioned taxonomy), and if a tool’s not doing the job, then drop it. Or, better, fix it. Because tags are an abstract idea that exists only in particular implementations.

After all, one could with some plausibility claim that online discussions are the most overrated concept in the social media world. But still they have value. That indicates an opportunity to build a better discussion service. … which is exactly what Jeff did by building Discourse.org.

Finally, I do think it’s important — even while trying to put tags into a less over-heated perspective [do perspectives overheat??] — to remember that when first introduced in the early 2000s, tags represented an important break with an old and long tradition that used the authority to classify as a form of power. Even if tagging isn’t always useful and isn’t as widely applicable as some of us thought it would be, tagging has done the important work of telling us that we as individuals and as a loose collective now have a share of that power in our hands. That’s no small thing.

2 Comments »

June 19, 2013

[lodlam] Topics

I’m at LODLAM (linked open data for libraries, archives, and museums) in Montreal. It’s an unconference with 100 people from 16 countries. Here are the topics being suggested at the opening session. (There will be more added to the agenda board.)

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

(Because this is an unconference, I probably will not be doing much more liveblogging.)

  • Taxonomy alignment

  • How to build a case for LOD

  • How to build a pattern library (a clear articulation for a problem, the context where the problem appears, and a pattern for its solution) for cultural linked open data

  • How to take PDF to the next level, integrating triples to make it open data? How to make it into a “portable data format”

  • How can we efficiently convert our data to LOD? USC has Karma and would like to convene a workshop about tools.

  • How to convert simple data to LOD? How to engage users in making that data better?

  • A cultural heritage standard.

  • User interfaces. What do we do after we create all of this data? [applause]

  • Progress since the prior LODLAM (in San Francisco)? BIBFRAME? Schema.org?

  • Preserving linked data

  • The NSA has built the ultimate linked data tool chain. What can we learn?

  • Internal use cases for linked data.

  • How to make use of dirty metadata

  • A draft ontology for MODS metadata (MODSRDF)

  • Collaborating on a harvesting/enrichment tool

  • Getty Vocabulary is being released as LOD [applause], but they need help building a community making sure they have the right ontologies, early adopters, etc.

  • The data exhaust from dSPACE and linking it to world problems — find the disconnects between the people who have problems and people with info helpful for those problems

  • Identities and authorities — linked data as an app-independent way of doing identity control and management

  • RDF cataloging interface

  • Curation and social relationships

  • Linked Open Data echo systems

  • A new understanding of search — ways LODers search isn’t familiar to most people

  • BIBFRAME

  • Open Annotation tools enabling end users to enrich the graph

  • Our collections are different for a reason. That manifests itself in the data structure. We should talk about this.

  • In the business writ large, maybe we need the confidence to be invisible. What does that mean?

  • Feedback loops once data has been exposed

  • Wikidata — the database that supports Wikipedia

  • Forming an international group to discuss archival data, particularly in LOD

  • Be the first to comment »

June 15, 2013

[2b2k][eim] My Stuttgart syllabus

I’ve just finished leading two days of workshops at University of Stuttgart as part of my fellowship at the Internazionales Zentrum für Kultur- und Technikforschung. (No, I taught in English.) This was for me a wonderful experience. First of all, the students were engaged, smart, talked from diverse standpoints, and fun. Second, it reminded me how to teach. I had so much trouble trying to structure sessions, feeling totally unsure how one does so. But the eight 1.5 hour sessions reminded me why I loved teaching.

For my own memory, here are the sessions (and if any of you were there and took notes, I’d love to see them):

Friday

#1 Cyberutopianism, technodeterminism, and Internet exceptionalism defined, with JP Barlow’s Declaration of the Independent of Cyberspace as an example. Class introductions.

#2 Information Age to Age of Connected. Why Ted Nelson’s Xanadu did not succeed the way the Web did. Rough technical architecture of the Net and (perhaps) its embedded political values. Hyperlinks.

#3 Digital order. Everything is miscellaneous? From information Retrieval to search engines. Schema-based databases to tagging.

#4 Networked knowledge. What knowledge looks like once it’s been freed of paper. Four challenges to networked knowledge (with many more added by the students.)

On Saturday we talked about topics that the students decided were interesting:

#1 Mobile net. Is Facebook making us more or less social? Why do we fill up every interstice by using Facebook on mobiles? What does this say about us and the notion of the self?

#2 Downloading. Do you download music illegally? What is your justification? How might artists respond? Why is the term “intellectual property” so loaded?

#3 Education. What makes a great in-person course? What makes for a miserable one? Oddly, many of the characteristics of miserable classes are also characteristics of MOOCs. What might we do about that? How much of this is caused by the fact that MOOCs are construed as courses in the traditional sense?

#4 Internet culture. Is there such a thing? If there are many, is any particular one to be privileged? How does the Net look to a culture that is dedicated to warding off what it says as corrupting influences? End with LolCatBible and the astounding TheJohnnyCashProject

Thank you, students. This experience meant a great deal to me.

2 Comments »

April 25, 2013

[eim][misc] Too big to categorize

Amanda Filipacchi has a great post at the New York Times about the problem with classifying American female novelists as American female novelists. That’s been going on at Wikipedia, with the result that the category American novelist was becoming filled predominantly with male novelists.

Part of this is undoubtedly due to the dumb sexism that thinks that “normal” novelists are men, and thus women novelists need to be called out. And even if the category male novelist starts being used, it still assumes that gender is a primary way of dividing up novelists, once you’ve segregated them by nation. Amanda makes both points.

From my point of view, the problem is inherent in hierarchical taxonomies. They require making decisions not only about the useful ways of slicing up the world, but also about which slices come first. These cuts reflect cultural and political values and have cultural and political consequences. They also get in the way of people who are searching with a different way of organizing the topic in mind. In a case like this, it’d be far better to attach tags to Wikipedia articles so that people can search using whatever parameters they need. That way we get better searchability, and Wikipedia hasn’t put itself in the impossible position of coming up with a taxonomy that is neutral to all points of view.

Wikipedia’s categories have been broken for a long time. We know this in the Library Innovation Lab because a couple of years ago we tried to find every article in Wikipedia that is about a book. In theory, you can just click on the “Book” category. In practice, the membership is not comprehensive. The categories are inconsistent and incomplete. It’s just a mess.

It may be that a massive crowd cannot develop a coherent taxonomy because of the differences in how people think about things. Maybe the crowd isn’t massive enough. Or maybe the process just needs far more guidance and regulation. But even if the crowd can bring order to the taxonomy, I don’t believe it can bring neutrality, because taxonomies are inherently political.

There are problems with letting people tag Wikipedia articles. Spam, for example. And without constraints, people can lard up an object with tags that are meaningful only to them, offensive, or wrong. But there are also social mechanisms for dealing with that. And we’ve been trained by the Web to lower our expectations about the precision and recall afforded by tags, whereas our expectations are high for taxonomies.

Go tags.

8 Comments »

November 22, 2011

Physical libraries in a digital world

I’m at the final meeting of a Harvard course on the future of libraries, led by John Palfrey and Jeffrey Schnapp. They have three guests in to talk about physical library space.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

David Lamberth lays out an idea as a provocation. He begins by pointing out that until the beginning of the 20th century, a library was not a place but only a collection of books. He gives a quick history of Harvard Library. After the library burned down in 1764, the libraries lived in fear of fire, until electric lights came in. The replacement library (Gore Hall) was built out of stone because brick structures need wood on the inside. But stone structures are dank, and many books had to be re-bound every 30 years. Once it filled up, 25-30 of Harvard libraries derived from the search for fireproof buildings, which helps explain the large distribution of libraries across campus. They also developed more than 40 different classification systems. At the beginning of the 20th C, Harvard’s collection was just over one million. Now it adds up to around 18M. [David's presentation was not choppy, the way this paraphrase is.]

In the 1980s, there was continuing debate about what to do about the need for space. The big issue was open or closed stacks. The faculty wanted the books on site so they could be browsed. But stack space is expensive and you tend to outgrow it faster than you think. So, it was decided not to build any more stack space. There already was an offsite repository (New England Book Depository), but it was decided to build a high density storage facility to remove the non-active parts of the collection to a cheaper, off-site space: The Harvard Depository (HD).

Now more than 40% of the physical collections are at HD. The Faculty of Arts and Sciences started out hostile to the idea, but “soon became converted.” The notion faculty had of browsing the shelves was based on a fantasy: Harvard had never had all the books on a subject on a shelf in a single facility. E.g., search on “Shakespeare” in the Harvard library system: 18,000 hits. Widener Library is where you’d expect to find Shakespeare books. But 8,000 of the volumes aren’t in Widener. Of Widener’s 10K Shakespeare, volumes, 4,500 are in HD. So, 25% of what you meant to browse is there. “Shelf browsing is a waste of time” if you’re trying to do thorough research. It’s a little better in the smaller libraries, but the future is not in shelf browsing. Open and closed stacks isn’t the question any more. “It’s just not possible any longer to do shelf browsing, unless we develop tools for browsing in a non-physical fashion.” E.g., catalog browsers, and ShelfLife (with StackView).

There’s nobody in the stacks any more. “It’s like the zombies have come and cleared people out.” People have new alternatives, and new habits. “But we have real challenges making sure they do as thorough research as possible, and that we leverage our collection.” About 12M of the 18M items are barcoded.

A task force saw that within 40 years, over 70% of the physical collection will be off site. HD was not designed to hold the part of the collection most people want to use. So, what can do that will give us pedagogical and intellectual benefit, and realizes the incredible resource that our collection is?

Let me present one idea, says David. The Library Task Force said emphatically that Harvard’s collection should be seen as one collection. It makes sense intellectually and financially. But that idea is in contention with the 56 physical libraries at Harvard. Also, most of our collection doesn’t circulate. Only some of it is digitally browsable, and some of that won’t change for a long long long time. E.g., our Arabic journals in Widener aren’t indexed, don’t publish cumulative indexes, and are very hard to index. Thus scholars need to be able to pull them off the shelves. Likewise for big collections of manuscripts that haven’t even been sorted yet.

One idea would be to say: Let’s treat physical libraries as one place as well. Think of them as contiguous, even though they’re not. What if bar-coded books stayed in the library you returned to them to? Not shelved by a taxonomy. Random access via the digital, and it tells you where the work is. And build perfect shelves for the works that need to be physically organized. Let’s build perfect Shakespeare shelves. Put them in one building. The other less-used works will be findable, but not browsable. This would require investing in better findability systems, but it would let us get past the arbitrariness of classification systems. Already David will usually go to Amazon to decide if he wants a book rather than take the 5 mins to walk to the library. By focusing on perfect shelves for what is most important to be browsable, resources would be freed up. This might make more space in the physical libraries, so “we could think about what the people in those buildings want to be doing,” so people would come in because there’s more going on. (David notes that this model will not go over well with many of his colleagues.)

53% of library space at Harvard is stack space. The other 47% is split between patron space and space staff. About 20-25% is space staff. Comparatively, Harvard is lower on patron space size than typical. The HD is holding half the collection in 20% of the space. It’s 4x as expensive to store a work on a stack on campus than off.

David responds to a question: The perfect shelves should be dynamic, not permanent. That will better serve the evolution of research. There are independent variables: Classification and shelf location. We certainly need classification, but it may not need to map to shelf locations. Widener has bibliographic lists and shelf lists. Barcodes give us more freedom; we don’t have to constantly return works to fixed locations.

Mike Barker: Students already build their own perfect shelves with carrels.

Q: What’s the case for ownership and retention if we’re only addressing temporal faculty needs?

A lot of the collecting in the first half of the 20 C was driven by faculty requests. Not now. The question of retention and purchase splits on the basis of how uncommon the piece of info is. If it’s being sold by Amazon, I don’t think it really matters if we retain it, because of the number of copies and the archival steps already in place. The more rare the work, the more we should think about purchase and retention. But under a third of the stack space on campus ideal environmental conditions. We shouldn’t put works we buy into those circumstances unless they’re being used.

Q: At the Law Library, we’re trying to spread it out so that not everyone is buying the same stuff. E.g., we buy Peruvian materials because other libraries aren’t. And many law books are not available digitally, so we we buy them … but we only buy one copy.

Yes, you’re making an assessment. In the Divinity library, Mike looked at the duplication rate. It was 53%. That is, 53% of our works are duplicated in other Harvard libraries.

Mike: How much do we spend on classification? To create call numbers? We annually spend about 1.5-2M on it, plus another million shelving it. So, $3M-3.5M total. (Mike warns that this is a “very squishy” number.) We circulate about 700,000 items a years. The total operating budget of the Library is about $152M. (He derived this number by asking catalogers who long it takes to classify an item without one, divided into salary.)

David: Scanning in tables of contents, indexes, etc., lets people find things without having to anticipate what they’re going to be interested in.

Q: Where does serendipity fall in this? What about when you don’t know what you’re looking for?

David: I agree completely. My dissertation depended on a book that no one had checked out since 1910. I found it on the stacks. But it’s not on the shelves now. Suppose I could ask a research librarian to bring me two shelves worth of stuff because I’m beginning to explore some area.

Q: What you’re suggesting won’t work so well for students. How would not having stacks affect students?

David: I’m being provocative but concrete. The status quo is not delivering what we think it does, and it hasn’t for the past three decades.

Q: [jeff goldenson] Public librarians tell us that the recently returned trucks are the most interesting place to go. We don’t really have the ability to see what’s moving in the Harvard system. Yes, there are privacy concerns, but just showing what books have been returned would be great.

Q: [palfrey] How much does the rise of the digital affect this idea? Also, you’ve said that the storage cost of a digital object may be more than that of physical objects. How does that affect this idea?

David: Copyright law is the big If. It’s not going away. But what kind of access do you have to digital objects that you own? That’s a huge variable. I’ve premised much of what I’ve said on the working notion that we will continue to build physical collections. We don’t know how much it will cost to keep a physical object for a long time. And computer scientists all say that digital objects are not durable. My working notion here is that the parts that are really crucial are the metadata pieces, which are more easily re-buildable if you have the physical objects. We’re not going to buy physical objects for all the digital items, so the selection principle goes back to how grey or black the items are. It depends on whether we get past the engineering question about digital durability — which depends a lot on electromagnetism as a storage medium, which may be a flash in the pan. We’re moving incrementally.

Q: [me] If we can identify the high value works that go on perfect shelves, why not just skip the physical shelves and increase the amount of metadata so that people can browse them looking for the sort of info they get from going to the physical shelf?

A: David: Money. We can’t spend too much on the present at the expense of the next century or two. There’s a threshold where you’d say that it’s worth digitizing them to the degree you’d need to replace physical inspection entirely. It’s a considered judgment, which we make, for example, when we decide to digitize exhibitions. You’d want to look at the opportunity costs.

David suggests that maybe the Divinity library (he’s in the Phil Dept.) should remove some stacks to make space for in-stack work and discussion areas. (He stresses that he’s just thinking out loud.)

Matthew Sheehy, who runs HD, says they’re thinking about how to keep books 500 years. They spend $300K/year on electricity to create the right environment. They’ve invested in redundancy. But, the walls of the HD will only last 100 years. [Nov. 25: I may have gotten the following wrong:] He thinks it costs about $1/ year to store a book, not the usual figure of $0.45.

Jeffrey Schnapp: We’re building a library test kitchen. We’re interested in building physical shelves that have digital lives as well.

[Nov. 25: Changed Philosophy school to Divinity, in order to make it correct. Switched the remark about the cost of physical vs. digital in the interest of truth.]

4 Comments »

October 4, 2011

ShelfLife and LibraryCloud: What we did all summer

We’re really really really pleased that the Digital Public Library of America has chosen two of our projects to be considered (at an Oct. 21 open plenary meeting) for implementation as part of the DPLA’s beta sprint. The Harvard Library Innovation Lab (Annie Cain, Paul Deschner, Jeff Goldenson, Matt Phillips, and Andy Silva), which I co-direct (along with Kim Dulin) worked insanely hard all summer to turn our prototypes for Harvard into services suitable for a national public library. I have to say I’m very proud of what our team accomplished, and below is a link that will let you try out what we came up with.

Upon the announcement of the beta sprint in May, we partnered up with folks at thirteen other institutions…an amazing group of people. Our small team at Harvard , with generous internal support, built ShelfLife and LibraryCloud on top of the integrated catalogs of five libraries, public and university, with a combined count of almost 15 million items, plus circulation data. We also pulled in some choice items from the Web, including metadata about every TED talk, open courseware, and Wikipedia pages about books. (Finding all or even most of the Wikipedia pages about books required real ingenuity on the part of our team, and was a fun project that we’re in the process of writing up.)

The metadata about those items goes into LibraryCloud, which collects and openly publishes that metadata via APIs and as linked open data. We’re proposing LibraryCloud to DPLA as a metadata server for the data DPLA collects, so that people can write library analytics programs, integrate library item information into other sites and apps, build recommendation and navigation systems, etc. We see this as an important way what libraries know can become fully a part of the Web ecosystem.

ShelfLife is one of those possible recommendation and navigation systems. It is based on a few basic principles hypotheses:

- The DPLA should be not only a service but a place where people can not only read/view items, but can engage with other users.

- Library items do not exist on their own, but are always part of various webs. It’s helpful to be able to switch webs and contexts with minimal disruption.

- The behavior of the users of a collection of items can be a good guide to those items; we think of this as “community relevance,” and calculate it as “shelfRank.”

- The system should be easy to use but enable users to drill down or pop back up easily.

- Libraries are social systems. Library items are social objects. A library navigation system should be social as well.

Apparently the DPLA agreed enough to select ShelfLife and LibraryCloud along with five other projects out of 38 submitted proposals. The other five projects — along with another three in a “lightning round” (where the stakes are doubled and anything can happen??) — are very strong contenders and in some cases quite amazing. It seems clear to our team that there are synergies among them that we hope and assume the DPLA also recognizes. In any case, we’re honored to be in this group, and look forward to collaborating no matter what the outcome.

You can try the prototype of ShelfLife and LibraryCloud here. Keep in mind please that this is live code running on top of a database of 15M items in real time, and that it is a prototype (and in certain noted areas merely a demo or sketch). I urge you to talk the tour first; there’s a lot in these two projects that you’ll miss if you don’t.

3 Comments »

August 23, 2011

The unframed Net

It’s clear that we don’t know how to explain the Internet. Is it a medium? Is it a culture, a subworld, or a parallel world? Is it a communication system? We bounce around, and we disagree.

Nevertheless, I am not as worried about our lacking the right framing for the Net as are some of my friends and colleagues.

For one thing, the same refusal to be pinned down characterizes everything. What something _is_ depends on what we’re trying to do with it, even within a culturally/linguistically homogeneous group. You can try this exercise with anything from terrorism to television to candy bars. (To pin myself down about why I think we can’t pin things down: I am sort of a phenomenological pragmatist. I also think that everything is miscellaneous, but that’s just me.)

So, we assimilate the Internet to existing concepts. There is nothing slovenly or cowardly about this. It’s how we understand things.

So, why does the Net seem special to us? Why does it seem to bust our frames ‘n’ paradigms? After all, we could assimilate the Net into older paradigms, because it is a series of tubes, and it is a communications medium, and it is a way of delivering content. Not only could we assimilate it, there are tremendous pressures to do so.

But for pragmatic (and Pragmatic) reasons, some of us (me included) don’t want to let that happen. It would foreclose cultural and political consequences we yearn for — the “we” that has flocked to the Net and that loves it for what it is and could be. The Net busts frames because it serves our purposes to have it do so.

This is why I find myself continuing to push Internet Exceptionalism, even though it does at times make me look foolish. Internet Exceptionalism is not an irrational exuberance. It is a political position. More exactly, it is a political yearning.

That’s why I’m not much bothered by the fact that we don’t have a new frame for the Net: frames are always inadequate, and the frame-busting nature of the Net serves our purposes.

In that sense, the way to frame the Internet is to keep insisting that the Net does not fit well into the old frame. Those of us who love the Net need to keep hammering on the fact that the old frames are inadequate, that the Net is exceptional, not yet assimilated to understanding, still to be invented, open to possibility, liberating of human and social potential, a framework for hope.

Eventually we’ll have the new frame for the Internet. It will be, I will boldly predict, the Internet :) In fact, open networks already are the new frame, and are sweeping aside old ways of thinking. Everything is a network.

The Internet will transition quickly from un-frameable to becoming the new frame. Until then, we should (imo) embrace the un-frameability of the Net as its framing.

11 Comments »

June 24, 2011

Tagging the National Archives

The National Archives is going all tag-arrific on us:

The Online Public Access prototype (OPA) just got an exciting new feature — tagging! As you search the catalog, we now invite you to tag any archival description, as well as person and organization name records, with the keywords or labels that are meaningful to you. Our hope is that crowdsourcing tags will enhance the content of our online catalog and help you find the information you seek more quickly.

Nice! (Hat tip to Infodocket for the tip)

Be the first to comment »

July 15, 2010

RadioBerkman interviews Tim Hwang

I am a Tim Hwang fanboy. Tim is one of the founders of ROFLcon and The Awesome Foundation. So, I was very happy to get to interview him for Radio Berkman. We talk about classifying Internet enthusiasts, and about whether there are schools of thought emerging among people who think about and research the Net.

Tim’s pretty damn insightful and delightful. In fact, Tim Hwang is awesome.

3 Comments »

October 11, 2009

Net uncovers new type of cloud

There are reports of a new type of cloud, one that is not currently in the official International Cloud Atlas. Or, possibly, it is a formation that’s been around forever, but the scattered reports are only now coalescing thanks to the Net.

According to Amazon’s review of Richard Hamblyn’s The Invention of Clouds, we only began thinking clouds could be categorized in 1802 when Luke Howard started giving public lectures. The very idea that clouds — the paradigm of uncatchable — could be divided into groups was (apparently) fascinating and thrilling. (Lamarck had also categorized clouds, but it didn’t catch on.)

A quick googly scan makes it seem that the cloud taxonomy is pretty messy. For example, the University of Illinois’ “cloud types” page lists four broad categories, and a list of miscellaneous clouds, each of which is categorized under one of the four basic types, evoking a “Huh?” reaction from at least one of us. The cloud taxonomy page at Univ. Missouri-Columbia lists eight types. Do you categorize by what they look like, how high they are, what they do (rain or not?), which celebrity profiles they resemble …? Categorizing clouds is truly a Borgesian task.

And, dammit, wouldn’t you know? Here’s a poem by Jorge Luis Borges called: “Clouds (II)” (with the line-endings probably removed):

Placid mountains meander through the air, or tragic cordilleras cast a pall, overshadowing the day. They are what we call clouds. And their shapes are often strange and rare. Shakespeare observed one once. It seemed to be a dragon. That one cloud of an afternoon still kindles in his words and blazes down, so that we go on seeing it today. What are the clouds? An architecture of chance? Perhaps they are the necessary things from which God weaves his vast imaginings, threads of a web of infinite expanse. Maybe the cloud is emptiness returning, just like the man who watches it this morning.

(translated by Richard Barnes. B; Robert Mezey; Richard Barnes. “Clouds (II). (poem).” The American Poetry Review. World Poetry, Inc. 1996. HighBeam Research. 11 Oct. 2009 v)

More Borges poems

2 Comments »


Switch to our mobile site