June 14, 2011

Linked Open Data take-aways

I just wrote up an informal trip report in the form of “take aways” from the LOD-LAM conference I attended a cople of weeks ago. Here is a lightly edited version.


Because it was an unconference, it was too participatory to enable us to take systematic notes. I did, however, interview a number of attendees, and have posted the videos on the Library Innovation Lab blog site. I actually have a few more yet to post. In addition, during the course of one of the sessions (on “Explaining LOD-LAM”), a few of us began constructing a FAQ.

Here’s some of what I took away from the conference.

– There is considerable momentum around linked open data, starting with the sciences where there is particular research value in compiling huge data sets. Many libraries are joining in.

– LOD for libraries will enable a very fluid aggregation of information from multiple types of sources around any particular object. E.g., a page about a Hogarth illustration (or about Hogarth, or about 18th century London, etc.) could quite easily aggregate information from any data set that knows something about that illustration or about topics linked to that illustration. This information could be used to build a page or to do research.

– Making data and metadata available as LOD enables maximal re-use by others.

– Doing so requires expertise, but should be less massively difficult than supporting many other standards.

– For the foreseeable future, this will be something libraries do in addition to supporting more traditional data standards; it will be an additional expense and effort.

– Although there is continuing debate about exactly which license to use when publishing library data sets, it seems that usually putting any form of license on the data other than a public domain waiver of licenses is likely to be (a) futile and (b) so difficult to deal with that it will inhibit re-use of the data, depriving it of value. (See the 4-star license proposal that came out of this conference.)

– The key point of resistance against LOD among libraries, archives and museums is the justified fear that once the data is released into the world, the curating institutions can no longer ensure that the metadata about an object is correct; the users of LOD might pick up a false attribution, inaccurate description, etc. This is a genuine risk, since LOD permits irresponsible use of data. The risk can be mitigated but not removed.


June 8, 2011

MacKenzie Smith on open licenses for metadata

MacKenzie Smith of MIT and Creative Commons talks about the new 4-star rating system for open licenses for metadata from cultural institutions:

The draft is up on the LOD-LAM site.

Here are some comments on the system from open access guru Peter Suber.


June 6, 2011

Peter Suber on the 4-star openness rating

One of the outcomes of the the LOD-LAM conference was a draft of an idea for a 4-star classification of openness of metadata from cultural institutions. The classification is nicely counter-intuitive, which is to say that it’s useful.

I asked Peter Suber, the Open Access guru, what he thought of it. He replied in an email:

First, I support the open knowledge definition and I support a star system to make it easy to refer to different degrees of openness.

* I’m not sure where this particular proposal comes from. But I recommend working with the Open Knowledge Foundation, which developed the open knowledge definition. The more key players who accept the resulting star system, the more widely it will be used.

* This draft overlooks some complexity in the 3-star entry and the 2-star entry. Currently it suggests that attribution through linking is always more open than attribution by other means (say, by naming without linking). But this is untrue. Sometimes one is more difficult than the other. In a given case, the easier one is more open by lowering the barrier to distribution.

If you or your software had both names and links for every datasource you wanted to attribute, then attribution by linking and attribution by naming would be about equal in difficulty and openness. But if you had names without links, then obtaining the links would be an extra burden that would delay or impede distribution.

The disparity in openness grows as the number of datasources increases. On this point, see the Protocol for Implementing Open Access Data (by John Wilbanks for Science Commons, December 2007).

Relevant excerpt: “[T]here is a problem of cascading attribution if attribution is required as part of a license approach. In a world of database integration and federation, attribution can easily cascade into a burden for scientists….Would a scientist need to attribute 40,000 data depositors in the event of a query across 40,000 data sets?” In the original context, Wilbanks uses this (cogently) as an argument for the public domain, or for shedding an attribution requirement. But in the present context, it complicates the ranking system. If you *did* have to attribute a result to 40,000 data sources, and if you had names but not links for many of those sources, then attribution by naming would be *much* easier than attribution by linking.

Solution? I wouldn’t use stars to distinguish methods of attribution. Make CC-BY (or the equivalent) the first entry after the public domain, and let it cover any and all methods of attribution. But then include an annotation explaining that some methods attribution increase the difficulty of distribution, and that increasing the difficulty will decrease openness. Unfortunately, however, we can’t generalize about which methods of attribution raise and lower this barrier, because it depends on what metadata the attributing scholar may already possess or have ready to hand.

* The overall implication is that anything less open than CC-BY-SA deserves zero stars. On the one hand, I don’t mind that, since I’d like to discourage anything less open than CC-BY-SA. On the other, while CC-BY-NC and CC-BY-ND are less open than CC-BY-SA, they’re more open than all-rights-reserved. If we wanted to recognize that in the star system, we’d need at least one more star to recognize more species.

I responded with a question: “WRT to your naming vs. linking comments: I assumed the idea was that it’s attribution-by-link vs. attribution-by-some-arbitrary-requirement. So, if I require you to attribute by sticking in a particular phrase or mark, I’m making it harder for you to just scoop up and republish my data: Your aggregating sw has to understand my rule, and you have to follow potentially 40,000 different rules if you’re aggregating from 40,000 different databases.

Peter responded:

You’re right that “if I require you to attribute by sticking in a particular phrase or mark, I’m making it harder for you to just scoop up and republish my data.” However, if I already have the phrases or marks, but not the URLs, then requiring me to attribute by linking would be the same sort of barrier. My point is that the easier path depends on which kinds of metadata we already have, or which kinds are easier for us to get. It’s not the case that one path is always easier than another.

But it might be the case that one path (attribution by linking) is *usually* easier than another. That raises a nice question: should that shifting, statistical difference be recognized with an extra star? I wouldn’t mind, provided we acknowledged the exceptions in an annotation.

June 5, 2011

How to digitize a million books

Brewster Kahle gives a tour of one of the Internet Archive‘s book scanning facilities. This one is part of the Archive’s San Francisco headquarters:

Recorded during a tour of the facilities, as part of the LOD-LAM conference.

June 2, 2011

[lodlam] The rise of Linked Open Data

At the Linked Open Data in Libraries, Archives and Museums conf [LODLAM], Jonathan Rees casually offered what I thought was useful a distinction. (Also note that I am certainly getting this a little wrong, and could possibly be getting it entirely wrong.)

Background: RDF is the basic format of data in the Semantic Web and LOD; it consists of statements of the form “A is in some relation to B.”

My paraphrase: Before LOD, we were trying to build knowledge representations of the various realms of the world. Therefore, it was important that the RDF triples expressed were true statements about the world. In LOD, triples are taken as a way of expressing data; take your internal data, make it accessible as RDF, and let it go into the wild…or, more exactly, into the commons. You’re not trying to represent the world; you’re just trying to represent your data so that it can be reused. It’s a subtle but big difference.

I also like John Wilbanks‘ provocative tweet-length explanation of LOD: “Linked open data is duct tape that some people mistake for infrastructure. Duct tape is awesome.”

Finally, it’s pretty awesome to be at a techie conference where about half the participants are women.