Joho the Blogmetadata Archives - Page 2 of 9 - Joho the Blog

July 3, 2012

[2b2k]The inevitable messiness of digital metadata

This is cross posted at the Harvard Digital Scholarship blog

Neil Jeffries, research and development manager at the Bodleian Libraries, has posted an excellent op-ed at Wikipedia Signpost about how to best represent scholarly knowledge in an imperfect world.

He sets out two basic assumptions: (1) Data has meaning only within context; (2) We are not going to agree on a single metadata standard. In fact, we could connect those two points: Contexts of meaning are so dependent on the discipline and the user's project and standpoint that it is unlikely that a single metadata standard could suffice. In any case, the proliferation of standards is simply a fact of life at this point.

Given those constraints, he asks, what's the best way to increase the interoperability of the knowledge and data that are accumulating on line at at pace that provokes extremes of anxiety and joy in equal measures? He sees a useful consensus emerging on three points: (a) There are some common and basic types of data across almost all aggregations. (b) There is increasing agreement that these data types have some simple, common properties that suffice to identify them and to give us humans an idea about whether we want to delve deeper. (c) Aggregations themselves are useful for organizing data, even when they are loose webs rather than tight hierarchies. 

Neil then proposes RDF and linked data as appropriate ways to capture the very important relationships among ideas, pointing to the Semantic MediaWiki as a model. But, he says, we need to capture additional metadata that qualifies the data, including who made the assertion, links to differences of scholarly opinion, omissions from the collection, and the quality of the evidence. "Rather than always aiming for objective statements of truth we need to realise that a large amount of knowledge is derived via inference from a limited and imperfect evidence base, especially in the humanities," he says. "Thus we should aim to accurately represent  the state of knowledge about a topic, including omissions, uncertainty and differences of opinion."

Neil's proposals have the strengths of acknowledging the imperfection of any attempt to represent knowledge, and of recognizing that the value of representing knowledge lies mainly in its getting linked it to its sources, its context, its controversies, and to other disciplines. It seems to me that such a system would not only have tremendous pragmatic advantages, for all its messiness and lack of coherence it is in fact a more accurate representation of knowledge than a system that is fully neatened up and nailed down. That is, messiness is not only the price we pay for scaling knowledge aggressively and collaboratively, it is a property of networked knowledge itself. 



June 6, 2012


I learned yesterday from Robin Wendler (who worked mightily on the project) that Harvard’s library catalog dataset of 12.3M records has been bulk downloaded a thousand times, excluding the Web spiderings. That seems like an awful lot to me, and makes me happy.

The library catalog dataset comprises bibliographic records of almost all of Harvard Library’s gigantic collection. It’s available under a CC 0 public domain license for bulk download, and can be accessed through an API via the DPLA’s prototype platform. More info here.

1 Comment »

April 24, 2012

[2b2k][everythingismisc]”Big data for books”: Harvard puts metadata for 12M library items into the public domain

(Here’s a version of the text of a submission I just made to BoingBong through their “Submitterator”)

Harvard University has today put into the public domain (CC0) full bibliographic information about virtually all the 12M works in its 73 libraries. This is (I believe) the largest and most comprehensive such contribution. The metadata, in the standard MARC21 format, is available for bulk download from Harvard. The University also provided the data to the Digital Public Library of America’s prototype platform for programmatic access via an API. The aim is to make rich data about this cultural heritage openly available to the Web ecosystem so that developers can innovate, and so that other sites can draw upon it.

This is part of Harvard’s new Open Metadata policy which is VERY COOL.

Speaking for myself (see disclosure), I think this is a big deal. Library metadata has been jammed up by licenses and fear. Not only does this make accessible a very high percentage of the most consulted library items, I hope it will help break the floodgates.

(Disclosures: 1. I work in the Harvard Library and have been a very minor player in this process. The credit goes to the Harvard Library’s leaders and the Office of Scholarly Communication, who made this happen. Also: Robin Wendler. (next day:) Also, John Palfrey who initiated this entire thing. 2. I am the interim head of the DPLA prototype platform development team. So, yeah, I’m conflicted out the wazoo on this. But my wazoo and all the rest of me is very very happy today.)

Finally, note that Harvard asks that you respect community norms, including attributing the source of the metadata as appropriate. This holds as well for the data that comes from the OCLC, which is a valuable part of this collection.


February 13, 2012

[2b2k] BibSoup is in beta

Congratulations to the Open Knowledge Foundation on the launch of BibSoup, a site where anyone can upload and share a bibliography. It’s a great idea, and an awesome addition to the developing knowledge ecosystem.

Comments Off on [2b2k] BibSoup is in beta

November 2, 2011

The hotel with no metadata

I’m staying at a “boutique” hotel in NYC that is so trendy that it has not only dressed its beautiful young staff in black, it has removed as much metadata as it can. There’s no sign outside. There are no pointers to the elevators on the room floors. The hotel floors in the elevator are poorly designated, so that two in our party ended up on a service floor, wandering looking for a way back into the public space of the hotel. The common areas are so underlit that I had to find a precious lamp to stand next to so that the person I was waiting for could find me. The room keycards are white and unmarked, without any indication therefore of which end goes in the lock.

Skipping metadata has always been a sign of mastery or in-ness. It’s like playing a fretless guitar. But hotels are for strangers and first-timers. I need me my metadata!

BTW, I think the hotel’s name is the Hudson, but it’s really not easy to tell.


October 4, 2011

ShelfLife and LibraryCloud: What we did all summer

We’re really really really pleased that the Digital Public Library of America has chosen two of our projects to be considered (at an Oct. 21 open plenary meeting) for implementation as part of the DPLA’s beta sprint. The Harvard Library Innovation Lab (Annie Cain, Paul Deschner, Jeff Goldenson, Matt Phillips, and Andy Silva), which I co-direct (along with Kim Dulin) worked insanely hard all summer to turn our prototypes for Harvard into services suitable for a national public library. I have to say I’m very proud of what our team accomplished, and below is a link that will let you try out what we came up with.

Upon the announcement of the beta sprint in May, we partnered up with folks at thirteen other institutions…an amazing group of people. Our small team at Harvard , with generous internal support, built ShelfLife and LibraryCloud on top of the integrated catalogs of five libraries, public and university, with a combined count of almost 15 million items, plus circulation data. We also pulled in some choice items from the Web, including metadata about every TED talk, open courseware, and Wikipedia pages about books. (Finding all or even most of the Wikipedia pages about books required real ingenuity on the part of our team, and was a fun project that we’re in the process of writing up.)

The metadata about those items goes into LibraryCloud, which collects and openly publishes that metadata via APIs and as linked open data. We’re proposing LibraryCloud to DPLA as a metadata server for the data DPLA collects, so that people can write library analytics programs, integrate library item information into other sites and apps, build recommendation and navigation systems, etc. We see this as an important way what libraries know can become fully a part of the Web ecosystem.

ShelfLife is one of those possible recommendation and navigation systems. It is based on a few basic principles hypotheses:

– The DPLA should be not only a service but a place where people can not only read/view items, but can engage with other users.

– Library items do not exist on their own, but are always part of various webs. It’s helpful to be able to switch webs and contexts with minimal disruption.

– The behavior of the users of a collection of items can be a good guide to those items; we think of this as “community relevance,” and calculate it as “shelfRank.”

– The system should be easy to use but enable users to drill down or pop back up easily.

– Libraries are social systems. Library items are social objects. A library navigation system should be social as well.

Apparently the DPLA agreed enough to select ShelfLife and LibraryCloud along with five other projects out of 38 submitted proposals. The other five projects — along with another three in a “lightning round” (where the stakes are doubled and anything can happen??) — are very strong contenders and in some cases quite amazing. It seems clear to our team that there are synergies among them that we hope and assume the DPLA also recognizes. In any case, we’re honored to be in this group, and look forward to collaborating no matter what the outcome.

You can try the prototype of ShelfLife and LibraryCloud here. Keep in mind please that this is live code running on top of a database of 15M items in real time, and that it is a prototype (and in certain noted areas merely a demo or sketch). I urge you to talk the tour first; there’s a lot in these two projects that you’ll miss if you don’t.


September 27, 2011

Libraries of the future

We’ve just posted the latest Library Innovation Lab podcast, this one with Karen Coyle who is a leading expert in Linked Open Data. Will we have perpetual but interoperable disagreements about how to classify and categorize works and decide what is the “same” work?

And, if you care about libraries and are in the Cambridge (MA) area on Oct. 4, there’s a kick off event at Sanders Theater at Harvard for a year of conversations about the future of libraries. Sounds great, although I unfortunately will be out of town :(

1 Comment »

June 14, 2011

Linked Open Data take-aways

I just wrote up an informal trip report in the form of “take aways” from the LOD-LAM conference I attended a cople of weeks ago. Here is a lightly edited version.


Because it was an unconference, it was too participatory to enable us to take systematic notes. I did, however, interview a number of attendees, and have posted the videos on the Library Innovation Lab blog site. I actually have a few more yet to post. In addition, during the course of one of the sessions (on “Explaining LOD-LAM”), a few of us began constructing a FAQ.

Here’s some of what I took away from the conference.

– There is considerable momentum around linked open data, starting with the sciences where there is particular research value in compiling huge data sets. Many libraries are joining in.

– LOD for libraries will enable a very fluid aggregation of information from multiple types of sources around any particular object. E.g., a page about a Hogarth illustration (or about Hogarth, or about 18th century London, etc.) could quite easily aggregate information from any data set that knows something about that illustration or about topics linked to that illustration. This information could be used to build a page or to do research.

– Making data and metadata available as LOD enables maximal re-use by others.

– Doing so requires expertise, but should be less massively difficult than supporting many other standards.

– For the foreseeable future, this will be something libraries do in addition to supporting more traditional data standards; it will be an additional expense and effort.

– Although there is continuing debate about exactly which license to use when publishing library data sets, it seems that usually putting any form of license on the data other than a public domain waiver of licenses is likely to be (a) futile and (b) so difficult to deal with that it will inhibit re-use of the data, depriving it of value. (See the 4-star license proposal that came out of this conference.)

– The key point of resistance against LOD among libraries, archives and museums is the justified fear that once the data is released into the world, the curating institutions can no longer ensure that the metadata about an object is correct; the users of LOD might pick up a false attribution, inaccurate description, etc. This is a genuine risk, since LOD permits irresponsible use of data. The risk can be mitigated but not removed.


June 8, 2011

MacKenzie Smith on open licenses for metadata

MacKenzie Smith of MIT and Creative Commons talks about the new 4-star rating system for open licenses for metadata from cultural institutions:

The draft is up on the LOD-LAM site.

Here are some comments on the system from open access guru Peter Suber.


June 6, 2011

Peter Suber on the 4-star openness rating

One of the outcomes of the the LOD-LAM conference was a draft of an idea for a 4-star classification of openness of metadata from cultural institutions. The classification is nicely counter-intuitive, which is to say that it’s useful.

I asked Peter Suber, the Open Access guru, what he thought of it. He replied in an email:

First, I support the open knowledge definition and I support a star system to make it easy to refer to different degrees of openness.

* I’m not sure where this particular proposal comes from. But I recommend working with the Open Knowledge Foundation, which developed the open knowledge definition. The more key players who accept the resulting star system, the more widely it will be used.

* This draft overlooks some complexity in the 3-star entry and the 2-star entry. Currently it suggests that attribution through linking is always more open than attribution by other means (say, by naming without linking). But this is untrue. Sometimes one is more difficult than the other. In a given case, the easier one is more open by lowering the barrier to distribution.

If you or your software had both names and links for every datasource you wanted to attribute, then attribution by linking and attribution by naming would be about equal in difficulty and openness. But if you had names without links, then obtaining the links would be an extra burden that would delay or impede distribution.

The disparity in openness grows as the number of datasources increases. On this point, see the Protocol for Implementing Open Access Data (by John Wilbanks for Science Commons, December 2007).

Relevant excerpt: “[T]here is a problem of cascading attribution if attribution is required as part of a license approach. In a world of database integration and federation, attribution can easily cascade into a burden for scientists….Would a scientist need to attribute 40,000 data depositors in the event of a query across 40,000 data sets?” In the original context, Wilbanks uses this (cogently) as an argument for the public domain, or for shedding an attribution requirement. But in the present context, it complicates the ranking system. If you *did* have to attribute a result to 40,000 data sources, and if you had names but not links for many of those sources, then attribution by naming would be *much* easier than attribution by linking.

Solution? I wouldn’t use stars to distinguish methods of attribution. Make CC-BY (or the equivalent) the first entry after the public domain, and let it cover any and all methods of attribution. But then include an annotation explaining that some methods attribution increase the difficulty of distribution, and that increasing the difficulty will decrease openness. Unfortunately, however, we can’t generalize about which methods of attribution raise and lower this barrier, because it depends on what metadata the attributing scholar may already possess or have ready to hand.

* The overall implication is that anything less open than CC-BY-SA deserves zero stars. On the one hand, I don’t mind that, since I’d like to discourage anything less open than CC-BY-SA. On the other, while CC-BY-NC and CC-BY-ND are less open than CC-BY-SA, they’re more open than all-rights-reserved. If we wanted to recognize that in the star system, we’d need at least one more star to recognize more species.

I responded with a question: “WRT to your naming vs. linking comments: I assumed the idea was that it’s attribution-by-link vs. attribution-by-some-arbitrary-requirement. So, if I require you to attribute by sticking in a particular phrase or mark, I’m making it harder for you to just scoop up and republish my data: Your aggregating sw has to understand my rule, and you have to follow potentially 40,000 different rules if you’re aggregating from 40,000 different databases.

Peter responded:

You’re right that “if I require you to attribute by sticking in a particular phrase or mark, I’m making it harder for you to just scoop up and republish my data.” However, if I already have the phrases or marks, but not the URLs, then requiring me to attribute by linking would be the same sort of barrier. My point is that the easier path depends on which kinds of metadata we already have, or which kinds are easier for us to get. It’s not the case that one path is always easier than another.

But it might be the case that one path (attribution by linking) is *usually* easier than another. That raises a nice question: should that shifting, statistical difference be recognized with an extra star? I wouldn’t mind, provided we acknowledged the exceptions in an annotation.

1 Comment »

« Previous Page | Next Page »