Joho the Blog » metadata

June 8, 2011

MacKenzie Smith on open licenses for metadata

MacKenzie Smith of MIT and Creative Commons talks about the new 4-star rating system for open licenses for metadata from cultural institutions:

The draft is up on the LOD-LAM site.

Here are some comments on the system from open access guru Peter Suber.

5 Comments »

June 6, 2011

Peter Suber on the 4-star openness rating

One of the outcomes of the the LOD-LAM conference was a draft of an idea for a 4-star classification of openness of metadata from cultural institutions. The classification is nicely counter-intuitive, which is to say that it’s useful.

I asked Peter Suber, the Open Access guru, what he thought of it. He replied in an email:

First, I support the open knowledge definition and I support a star system to make it easy to refer to different degrees of openness.

* I’m not sure where this particular proposal comes from. But I recommend working with the Open Knowledge Foundation, which developed the open knowledge definition. The more key players who accept the resulting star system, the more widely it will be used.

* This draft overlooks some complexity in the 3-star entry and the 2-star entry. Currently it suggests that attribution through linking is always more open than attribution by other means (say, by naming without linking). But this is untrue. Sometimes one is more difficult than the other. In a given case, the easier one is more open by lowering the barrier to distribution.

If you or your software had both names and links for every datasource you wanted to attribute, then attribution by linking and attribution by naming would be about equal in difficulty and openness. But if you had names without links, then obtaining the links would be an extra burden that would delay or impede distribution.

The disparity in openness grows as the number of datasources increases. On this point, see the Protocol for Implementing Open Access Data (by John Wilbanks for Science Commons, December 2007).

Relevant excerpt: “[T]here is a problem of cascading attribution if attribution is required as part of a license approach. In a world of database integration and federation, attribution can easily cascade into a burden for scientists….Would a scientist need to attribute 40,000 data depositors in the event of a query across 40,000 data sets?” In the original context, Wilbanks uses this (cogently) as an argument for the public domain, or for shedding an attribution requirement. But in the present context, it complicates the ranking system. If you *did* have to attribute a result to 40,000 data sources, and if you had names but not links for many of those sources, then attribution by naming would be *much* easier than attribution by linking.

Solution? I wouldn’t use stars to distinguish methods of attribution. Make CC-BY (or the equivalent) the first entry after the public domain, and let it cover any and all methods of attribution. But then include an annotation explaining that some methods attribution increase the difficulty of distribution, and that increasing the difficulty will decrease openness. Unfortunately, however, we can’t generalize about which methods of attribution raise and lower this barrier, because it depends on what metadata the attributing scholar may already possess or have ready to hand.

* The overall implication is that anything less open than CC-BY-SA deserves zero stars. On the one hand, I don’t mind that, since I’d like to discourage anything less open than CC-BY-SA. On the other, while CC-BY-NC and CC-BY-ND are less open than CC-BY-SA, they’re more open than all-rights-reserved. If we wanted to recognize that in the star system, we’d need at least one more star to recognize more species.

I responded with a question: “WRT to your naming vs. linking comments: I assumed the idea was that it’s attribution-by-link vs. attribution-by-some-arbitrary-requirement. So, if I require you to attribute by sticking in a particular phrase or mark, I’m making it harder for you to just scoop up and republish my data: Your aggregating sw has to understand my rule, and you have to follow potentially 40,000 different rules if you’re aggregating from 40,000 different databases.

Peter responded:

You’re right that “if I require you to attribute by sticking in a particular phrase or mark, I’m making it harder for you to just scoop up and republish my data.” However, if I already have the phrases or marks, but not the URLs, then requiring me to attribute by linking would be the same sort of barrier. My point is that the easier path depends on which kinds of metadata we already have, or which kinds are easier for us to get. It’s not the case that one path is always easier than another.

But it might be the case that one path (attribution by linking) is *usually* easier than another. That raises a nice question: should that shifting, statistical difference be recognized with an extra star? I wouldn’t mind, provided we acknowledged the exceptions in an annotation.

1 Comment »

June 2, 2011

OCLC to release 1 million book records

At the LODLAM conference, Roy Tennant said that OCLC will be releasing the bibliographic info about the top million most popular books. It will be released in a linked data format, under an Open Database license. This is a very useful move, although we need to know what the license is. We can hope that it does not require attribution, and does not come with any further license restrictions. But Roy was talking in the course of a timed two-minute talk, so he didn’t have a lot of time for details.

This is at least a good step and maybe more than that.

2 Comments »

Schema.org

Bing, Google and Yahoo have announced schema.org, where you can find markup to embed in your HTML that will help those search engines figure out whether you’re talking about a movie, a person, a recipe, etc. The markup seems quite simple. But, more important, by using it your page is more likely to be returned when someone is looking for what your page talks about.

Having the Big Three search engines dictating the metadata form is likely to be a successful move. SEO is a powerful motivator.

4 Comments »

May 27, 2011

A Declaration of Metadata Openness

Discovery, the metadata ecology for UK education and research, invites stakeholders to join us in adopting a set of principles to enhance the impact of our knowledge resources for the furtherance of scholarship and innovation…

What follows are a set of principles that are hard to disagree with.

1 Comment »

What foxes eat, with a twist ending

Having seen a fox crossing Comm Ave in Boston yesterday…


View Larger Map

…I googled to find out what they eat. I went to a helpful article in Time magazine, and discovered that foxes are 44% rabbit.

But the last sentence in the article gave me a WTF moment.

After I realized what was going on, it seemed to me that the article provokes several topics for discussion: The Demeaning Power of Condescension. The Ease with which an Entire Culture can be Trivialized. And, Please, People, Make Your Metadata More Obvious!

1 Comment »

May 17, 2011

[dpla] Europeana

About fifteen of us are meeting with Europeana in their headquarters in The Haag.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Harry Verwayen (business development director) gives us some background. Europeana started in 2005, in the wake of Google’s digitization of books. In 2008, the program itself began. It is a catalog that collects metadata, a small image, and a pointer. By 2011, they had 18,745,000 objects from thousands of partner institutions. It has been about getting knowledge into one place (Giovanni Pico della Mirandola). They believe all metadata should be widely and freely available for all use, and all public domain material should be freely available for all.

What are the value propositions for its constituencies? For end users, it’s a trusted source. For providers, it’s visibility; there is tension here because the providers want to measure visibility by hits on the portal, but Europeana wants to make the material available anywhere through linked open data. For policy makers, it’s inclusion. For the market, it’s growth. Four fnctions:

1. Aggregate: Council of content providers and aggregators. They want to always get more and better content. And they want to improve data quality.

2. Facilitate: Share knowledge. Strengthen advocacy of openness. Foster R&D.

3. Distribute: Making it available. Develop partnerships.

Engage Virtual exhibits, social media, e.g., collect user-generated content about WWI.

From all knowledge in one place, to all knowledge everywhere.

Q: If you were starting out now, would you go down the same path?
A: It’s important to have a clear focus. E.g., the funding politicians like to have a single portal page, but don’t focus on that. You need to have one, but 80% of our visitors come in from Google. The chances that users will go to DPLA via the portal are small. You need it, but you it shouldn’t be the focus of your efforts. .

Q: What is your differentiator?
A: Secure material from institutions, and openness.

Q: What are your use cases?
A: It’s the only place you can search across libraries, museums. We have been aggregating content. Things are now available without having to search thousands of sites.

Q: Next stage?
A: We’re flipping from supply to demand side. Make it available openly to see what other people can do with it. Right now the API is open to partners, but we plan on opening it up.

Q: How many users>
A: About 5M portal and API visitors last year.

Q: Your team?
A: Main office is 5M euros, 40 people [I think].

What’s your brand?
A: You come here if you want to do some research on Modigliani and want to find materials across museums and libraries. It’s digitized cultural heritage. But that’s widely defined. We have archives of press photography, a large collection of advertising posters, etc. But we’re not about providing access to popular works, e.g., recent novels.

Q: Any partners see a pickup in traffic since joining Europeana?
A: Yes. Not earth-shaking but noticeable.

What’s the biggest criticism?
A: Some partners feel that we’re pushing them into openness.

What level of services? Just a catalog, or create your own viewers, e.g.?
A: First, be a good catalog. Over the next five years, we’ll develop more. We do provide a search engine that you can use on your Web site.

Jan Molendijk talks on the tech/ops side. He says people see Europeana in many different ways: web portal, search engine, metadata repository, network organization, and “great fun.” The participating organizations love to work with Europeana.

The tech challenges: There are four domains (libraries, archives, museums, audiovisual), each with their own metadata standards. 26 languages. Distributed development. The metadata comes in in original languages. There’s too much to crowd-source. Also, there’s a difference between metadata search and full-text search, of course. We represent metadata as traditional docs and index them. The metadata fields allow greater precision. But full-text search engines expect docs to have thousands of words, but these metadata docs have dozen of words; the fewer words, the less well the search engines work; e.g., a short doc has fewer matches and scores lower on relevancy. Also, with a small team, much of the work gets farmed out.

15% in French, 14% in German, 11% in English. The distribution curve of most viewed objects count for less than 0.1% of views. Most get viewed 1 time per year or less. Our distribution curve starts low and flattens slowly. A highly viewed object is viewed perhaps 1,500 times in a month, and it’s usually tied to a promotion.

What type of group structures do you have? You could translate at that level and the rest would inherit.
A: We are not going to translate at the item level.

Collection models?
A: Originally, not even nesting, Now we use EDM. Now can arbitrarily connect pieces, as extensions, but we’re not doing that yet.

Europeana is designed to be scalable and robust. All layers can be executed on separate machines, and on multiple machines. They have four portal servers, two Solr services, and two image servers. Solr is good at indexing and point to an object, but not good at fetching from itself.

They don’t host it.

They use stateless protocols and very long urls.

Data providers give them the original metadata plus a mapping file. They map to EDM. They have a staff of three that handles the data ingestion. The processes have to be lightweight and automated, but 40-50% of development time still will go to metadata input: ingestion, enrichment, harvesting.

They publish through their portal, linked open data, OAI-MPH, API’s, widgets, and apps.

Annette Friberg talks about aggregation projects. Europeana is pan-European and across domains. Europeana would like to work with all the content providers, but there are only 40 people on stafff, so they instead work with about a relatively small number of aggregators. Those represent thousands and thousands of content providers. They have a Council of Content Providers and Aggregators.

Q: What should we avoid?
A: The largest challenge is the role of the content providers.

Q: Does clicking on a listing always take you out to the owner’s site?
A: Yes, almost always. And that’s a problem for providing a consistent user experience.

Valentina talks about the ingestion inflow [link] If you want to provide content, you can go to a form that asks some basic questions about copyright, topic, link to the object. It’s reviewed by the staff; they reject content that is not from a trustworthy source. Then you get a technical questionnaire: the quantity and type of materials, the format of the metadata, the frequency of updates, etc. They harvest metadata in the ESE format (Europeana Semantic Elements). They use OAI-PMH for harvesting. They enrich it with some data, do some quality checking, and upload it. They also cache the thumbnail. At the moment they are not doing incremental harvesting, so an update requires reimporting the entire collection, but they’re working on it.

They have started requiring donators to fill in a few fields of basic metadata, including the work’s title and a link to an image to be thumbnailed. But it’s still very minimal, in order to lower the hurdle.

Q: [me] In the US, it would be flooeded with bogus institutions eager to have their work displayed: porn, racist and extremist groups, etc.
A: We check to see if it’s legit. Is it a member of professional orgs? What do their peers say? We make a decision.

Be the first to comment »

April 25, 2011

The touch of metadata

Here’s a surprisingly touching video from Jon Voss, touting the power of metadata:

I say “surprisingly touching” because it is about metadata, after all. Linked Open Data, to be exact. Or maybe it’s just me.

Be the first to comment »

March 2, 2011

Questions from and for the Digital Public Library of America workshop

I got to attend the Digital Public Library of America‘s first workshop yesterday. It was an amazing experience that left me with the best kind of headache: Too much to think about! Too many possibilities for goodness!

Mainly because the Chatham House Rule was in effect, I tweeted instead of live-blogged; it’s hard to do a transcript-style live-blog when you’re not allowed to attribute words to people. (The tweet stream was quite lively.) Fortunately, John Palfrey, the head of the steering committee, did some high-value live-blogging, which you can find here: 1 2 3 4.

The DPLA is more of an intention than a plan. The DPLA is important because the intention is for something fundamentally liberating, the people involved have been thinking about and working on related projects for years, and the institutions carry a great deal of weight. So, if something is going to happen that requires widespread institutional support, this is the group with the best chance. The year of workshops that began yesterday aims at helping to figure out how the intention could become something real.

So, what is the intention? Something like: To bring the benefits of public libraries to every American. And there is, of course, no consensus even about a statement that broad. For example, the session opened with a discussion of public versus research libraries (with the “versus” thrown into immediate question). And, Terry Fisher at the very end of the day suggested that the DPLA ought to stand for a principle: Knowledge should be free and universally accessible. Throughout the course of the day, many other visions and pragmatic possibilities were raised by the sixty attendees. [Note: I've just violated the Chatham Rule by naming Terry, but I'm trusting he won't mind. Also, I very likely got his principle wrong. It's what I do.]

I came out of it invigorated and depressed at the same time. Invigorated: An amazing set of people, very significant national institutions ready to pitch in, an alignment on the value of access to the works of knowledge and culture. Depressed: The !@#$%-ing copyright laws are so draconian and, well, stupid, that it is hard to see how to take advantage of the new ways of connecting to ideas and to one another. As one well-known Internet archivist said, we know how to make works of the 19th and 21st centuries accessible, but the 20th century is pretty much lost: Anything created after 1923 will be in copyright about as long as there’s a Sun to read by, and the gigantic mass of works that are out of print, but the authors are dead or otherwise unreachable, is locked away as firmly as an employee restroom at a Disney theme park.

So, here are some of the issues we discussed yesterday that I found came home with me. Fortunately, most are not intractable, but all are difficult to resolve and, some, to implement:

Should the DPLA aggregate content or be a directory? Much of the discussion yesterday focused on the DPLA as an aggregation of e-works. Maybe. But maybe it should be more of a directory. That’s the approach taken by the European online library, Europeana. But being a directory is not as glamorous or useful. And it doesn’t use the combined heft of the participating institutions to drive more favorable licensing terms or legislative changes since it itself is not doing any licensing.

Who is the user? How generic? Does the DPLA have to provide excellent tools for scholars and researchers, too? (See the next question.)

Site or ecology? At one extreme, the DPLA could be nothing but a site where you find e-content. At the other extreme, it wouldn’t even have a site but would be an API-based development platform so that others can build sites that are tuned to specific uses and users. I think the room agrees that it has to do both, although people care differently about the functions. It will have to provide a convenient way for users to find ebooks, but I hope that it will have an incredibly robust and detailed API so that someone who wants to build a community-based browse-and-talk environment for scholars of the Late 19th Century French Crueller can. And if I personally had to decide between the DPLA being a site or metadata + protocols + APIs, I’d go with the righthand disjunct in a flash.

Should the DPLA aim at legislative changes? My sense of the room is that while everyone would like to see copyright heavily amended, DPLA needs to have a strategy for launching while working within existing law.

Should the DPLA only provide access to materials users can access for free? That meets much of what we expect from public libraries (although many local libraries do charge a little for DVDs), but it fails Terry Fisher’s principle. (I don’t mean to imply that everyone there agreed with Terry, btw.)

What should the DPLA do to launch quickly and well? The sense of the room was that it’s important that DPLA not get stuck in committee for years, but should launch something quickly. Unfortunately, the easiest stuff to launch with are public domain works, many of which are already widely available. There were some suggestions for other sources of public domain works, such as government documents. But, then the DPLA would look like a specialty library, instead of the first place people turn to when they want an e-book or other such content.

How to pay for it? There was little talk of business models yesterday, but it was a short day for a big topic. There were occasional suggestions, such as just outright buying e-books (rather than licensing them), in part to meet the library’s traditional role of preserving works as well as providing access to them.

How important is expert curation? There seemed to be a genuine divide — pretty much undiscussed, possibly because it’s a divisive topic — about the value of curation. A few people suggested quite firmly that expert curation is a core value provided by libraries: you go to the library because you know you can trust what is in it. I personally don’t see that scaling, think there are other ways of meeting the same need, and worry that the promise is itself illusory. This could turn out to be a killer issue. Who determines what gets into the DPLA (if the concept of there being an inside to the DPLA even turns out to make sense)?

Is the environment stable enough to build a DPLA? Much of the conversation during the workshop assumed that book and journal publishers are going to continue as the mediating centers of the knowledge industry. But, as with music publishers, much of the value of publishers has left the building and now lives on the Net. So, the DPLA may be structuring itself around a model that is just waiting to be disrupted. Which brings me to the final question I left wondering about:

How disruptive should the DPLA be? No one’s suggesting that the DPLA be a rootin’ tootin’ bay of pirates, ripping works out of the hands of copyright holders and setting them free, all while singing ribald sea shanties. But how disruptive can it be? On the one hand, the DPLA could be a portal to e-works that are safely out of copyright or licensed. That would be useful. But, if the DPLA were to take Terry’s principle as its mission — knowledge ought to be free and universally accessible — the DPLA would worry less about whether it’s doing online what libraries do offline, and would instead start from scratch asking: Given the astounding set of people and institutions assembled around this opportunity, what can we do together to make knowledge as free and universally accessible as possible? Maybe a library is not the best transformative model.

Of course, given the greed-based, anti-knowledge, culture-killing copyright laws, the fact may be that the DPLA simply cannot be very disruptive. Which brings me right back to my depression. And yet, exhilaration.

Go figure.

The DPLA wiki is here.

4 Comments »

Lists of lists of lists

Here’s Wikipedia’s List of lists of lists. (via Jimmy Wales)

Can’t we go just one more level deep? Ask yourself: What would Christopher Nolan do?

3 Comments »

« Previous Page | Next Page »