Joho the Blog » metadata

March 6, 2014

Dan Cohen on the DPLA’s cloud proposal to the FCC

I’ve posted a podcast interview with Dan Cohen, the executive director of the Digital Public Library of America about their proposal to the FCC.

The FCC is looking for ways to modernize the E-Rate program that has brought the Internet to libraries and schools. The DPLA is proposing DPLA Local, which will enable libraries to create online digital collections using the DPLA’s platform.

I’m excited about this for two reasons beyond the service it would provide.

First, it could be a first step toward providing cloud-based library services, instead of the proprietary, closed, expensive systems libraries typically use to manage their data. (Evergreen, I’m not talking about you, you open source scamp!)

Second, as libraries build their collections using DPLA Local, their metadata is likely to assume normalized forms, which means that we should get cross-collection discovery and semantic riches.

Here’s the proposal itself. And here’s where you can comment to the FCC about it.

Be the first to comment »

December 24, 2013

Schema.org…now for datasets!

I had a chance to talk with Dan Brickley today, a semanticizer of the Web whom I greatly admire. He’s often referred to as a co-creator of FOAF, but these days he’s at Google working on Schema.org. He pointed me to the work Schema has been doing with online datasets, which I hadn’t been aware of. Very interesting.

Schema.org, as you probably know, provides a set of terms you can hide inside the HTML of your page that annotate what the visible contents are about. The major search engines — Google, Bing, Yahoo, Yandex — notice this markup and use it to provide more precise search results, and also to display results in ways that present the information more usefully. For example, if a recipe on a page is marked up with Schema.org terms, the search engine can identify the list of ingredients and let you search on them (“Please find all recipes that use butter but not garlic”) and display them in a more readable away. And of course it’s not just the search engines that can do this; any app that is looking at the HTML of a page can also read the Schema markup. There are Schema.org schemas for an ever-expanding list of types of information…and now datasets.

If you go to Schema.org/Dataset and scroll to the bottom where it says “Properties from Dataset,” you’ll see the terms you can insert into a page that talk specifically about the dataset referenced. It’s quite simple at this point, which is an advantage of Schema.org overall. But you can see some of the power of even this minimal set of terms over at Google’s experimental Schema Labs page where there are two examples.

The first example (click on the “view” button) does a specialized Google search looking for pages that have been marked up with Schema’s Dataset terms. In the search box, try “parking,” or perhaps “military.” Clicking on a return takes you to the original page that provides access to the dataset.

The second demo lets you search for databases related to education via the work done by LRMI (Learning Resource Metadata Initiative); the LRMI work has been accepted (except for the term useRightsUrl) as part of Schema.org. Click on the “view” button and you’ll be taken to a page with a search box, and a menu that lets you search the entire Web or a curated list. Choose “entire Web” and type in a search term such as “calculus.”

This is such a nice extension of Schema.org. Schema was designed initially to let computers parse information on human-readable pages (“Aha! ‘Butter’ on this page is being used as a recipe ingredient and on that page as a movie title“), but now it can be used to enable computers to pull together human-readable lists of available datasets.

I continue to be a fan of Schema because of its simplicity and pragmatism, and, because the major search engines look for Schema markup, people have a compelling reason to add markup to their pages. Obviously Schema is far from the only metadata scheme we need, nor does it pretend to be. But for fans of loose, messy, imperfect projects that actually get stuff done, Schema is a real step forward that keeps taking more steps forward.

Be the first to comment »

December 22, 2013

The Bogotá Manhattan recipe + markup

Here’s a recipe for a Manhattan cocktail that I like. The idea of adding Kahlua came from a bartender in Philadelphia. I call it a Bogotá Manhattan because of the coffee.

You can’t tell by looking at this post that it’s marked up with Schema.org codes, unless you View Source. These codes let the search engines (and any other computer program that cares to look) recognize the meaning of the various elements. For example, the line “a splash of Kahlua” actually reads:

<span itemprop=”ingredients”>a splash of Kahlua</span>

“itemprop=ingredients” says that the visible content is an ingredient. This does not help you as a reader at all, but it means that a search engine can confidentally include this recipe when someone searches for recipes that contain Kahlua. Markup makes the Web smarter, and Schema.org is a lightweight, practical way of adding markup, with the huge incentive that the major search engines recognize Schema.

So, here goes:

Bogotá Manhattan

A variation on the classic Manhattan — a bit less bitter, and a bit more complex.

Prep Time: 3 minutes
Yield: 1 drink

Ingredients:

  • 1 shot bourbon

  • 1 shot sweet Vermouth

  • A few shakes of Angostura bitters

  • A splash of Kahlua

  • A smaller splash of grenadine or maraschino cherry juice

  • 1 maraschino cherry and/or small slice of orange as garnish. Delicious garnish.

Instructions:

Shake together with ice. Strain and serve in a martini glass, or (my preference) violate all norms by serving in a small glass with ice.

Here’s the Schema.org markup for recipes. author url

2 Comments »

August 4, 2013

Paradata

Hanan Cohen points me to a blog post by a MLIS student at Haifa U., named Shir, in which she discourses on the term “paradata.” Shir cites Mark Sample who in 2011 posted a talk he had given at an academic conference, Mark notes the term’s original meaning:

In the social sciences, paradata refers to data about the data collection process itself—say the date or time of a survey, or other information about how a survey was conducted.

Mark intends to give it another meaning, without claiming to have worked it out fully. :

…paradata is metadata at a threshold, or paraphrasing Genette, data that exists in a zone between metadata and not metadata. At the same time, in many cases it’s data that’s so flawed, so imperfect that it actually tells us more than compliant, well-structured metadata does.

His example is We Feel Fine, a collection of tens of thousands (or more … I can’t open the site because Amtrak blocks access to what it intuits might be intensive multimedia) of sentences that begin “I feel” from many, many blogs. We Feel Fine then displays the stats in interesting visualizations. Mark writes:

…clicking the Age visualizations tells us that 1,223 (of the most recent 1,500) feelings have no age information attached to them. Similarly, the Location visualization draws attention to the large number of blog posts that lack any metadata regarding their location.

Unlike many other massive datamining projects, say, Google’s Ngram Viewer, We Feel Fine turns its missing metadata into a new source of information. In a kind of playful return of the repressed, the missing metadata is colorfully highlighted—it becomes paradata. The null set finds representation in We Feel Fine.

So, that’s one sense of paradata. But later Mark makes it clear (I think) that We Feel Fine presents paradata in a broader sense: it is sloppy in its data collection. It strips out HTML formatting, which can contain information about the intensity or quality of the statements of feeling the project records. It’s lazy in deciding which images from a target site it captures as relevant to the statement of feeling. Yet, Mark finds great value in We Feel Fine.

His first example, where the null set is itself metadata, seems unquestionably useful. It applies to any unbounded data set. For example, that no one chose answer A on a multiple choice test is not paradata, just as the fact that no one has checked out a particular item from a library is not paradata. But that no one used the word “maybe” in an essay test is paradata, as would be the fact that no one has checked out books in Aramaic and Klingon in one bundle. Getting a zero in a metadata category is not paradata; getting a null in a category that had not been anticipated is paradata. Paradata should therefore include which metadata categories are missing from a schema. E.g., that Dublin Core does not have a field devoted to reincarnation says something about the fact that it was not developed by Tibetans.

But I don’t think that’s at the heart of what Mark means by paradata. Rather, the appearance of the null set is just one benefit of considering paradata. Indeed, I think I’d call this “implicit metadata” or “derived metadata,” not “paradata.”

The fuller sense of paradata Mark suggests — “data that exists in a zone between metadata and not metadata” — is both useful and, as he cheerfully acknowleges, “a big mess.” It immediately raises questions about the differences between paradata and pseudodata: if We Feel Fine were being sloppy without intending to be, and if it were presenting its “findings” as rigorously refined data at, say, the biennial meeting of the Society for Textual Analysis, I don’t think Mark would be happy to call it paradata.

Mark concludes his talk by pointing at four positive characteristics of the We Feel Fine site:? It’s inviting, paradata, open, and juicy. (“Juicy” means that there’s lots going on and lots to engage you.) It seems to me that the site’s only an example of paradata because of the other three. If it were a jargon-filled, pompous site making claims to academic rigor, the paradata would be pseudodata.

This isn’t an objection or a criticism. In fact, it’s the opposite. Mark’s post, which is based on a talk that he gave at the Society for Textual Analysis, is a plea for research thatis inviting, open, juicy, and is willing to acknowledge that its ideas are unfinished. Mark’s post is, of course, paradata.

Be the first to comment »

June 22, 2013

What I learned at LODLAM

On Wednesday and Thursday I went to the second LODLAM (linked open data for libraries, archives, and museums) unconference, in Montreal. I’d attended the first one in San Francisco two years ago, and this one was almost as exciting — “almost” because the first one had more of a new car smell to it. This is a sign of progress and by no means is a complaint. It’s a great conference.

But, because it was an unconference with up to eight simultaneous sessions, there was no possibility of any single human being getting a full overview. Instead, here are some overall impressions based upon my particular path through the event.

  • Serious progress is being made. E.g., Cornell announced it will be switching to a full LOD library implementation in the Fall. There are lots of great projects and initiatives already underway.

  • Some very competent tools have been developed for converting to LOD and for managing LOD implementations. The development of tools is obviously crucial.

  • There isn’t obvious agreement about the standard ways of doing most things. There’s innovation, re-invention, and lots of lively discussion.

  • Some of the most interesting and controversial discussions were about whether libraries are being too library-centric and not web-centric enough. I find this hugely complex and don’t pretend to understand all the issues. (Also, I find myself — perhaps unreasonably — flashing back to the Standards Wars in the late 1980s.) Anyway, the argument crystallized to some degree around BIBFRAME, the Library of Congress’ initiative to replace and surpass MARC. The criticism raised in a couple of sessions was that Bibframe (I find the all caps to be too shouty) represents how libraries think about data, and not how the Web thinks, so that if Bibframe gets the bib data right for libraries, Web apps may have trouble making sense of it. For example, Bibframe is creating its own vocabulary for talking about properties that other Web standards already have names for. The argument is that if you want Bibframe to make bib data widely available, it should use those other vocabularies (or, more precisely, namespaces). Kevin Ford, who leads the Bibframe initiative, responds that you can always map other vocabs onto Bibframe’s, and while Richard Wallis of OCLC is enthusiastic about the very webby Schema.org vocabulary for bib data, he believes that Bibframe definitely has a place in the ecosystem. Corey Harper and Debra Riley-Huff, on the other hand, gave strong voice to the cultural differences. (If you want to delve into the mapping question, explore the argument about whether Bibframe’s annotation framework maps to Open Annotation.)

  • I should add that although there were some strong disagreements about this at LODLAM, the participants seem to be genuinely respectful.

  • LOD remains really really hard. It is not a natural way of thinking about things. Of course, neither are old-fashioned database schemas, but schemas map better to a familiar forms-based view of the world: you fill in a form and you get a record. Linked data doesn’t even think in terms of records. Even with the new generation of tools, linked data is hard.

  • LOD is the future for library, archive, and museum data.


Here’s a list of brief video interviews I did at LODLAM:

June 20, 2013

[lodlam] Richard Wallis on Schema.org

Richard Wallis [twitter: rjw] of OCLC explains the appeal of Schema.org for libraries, and its place in the ecosystem.

1 Comment »

February 17, 2013

DPLA does metadata right

The Digital Public Library of America‘s policy on metadata was discussed during the recent board of directors call, and the DPLA is, in my opinion, getting it exactly and admirably right. (See Infodocket for links.) The metadata that the DPLA aggregates will be openly available and in the public domain. But just so there won’t be any doubt or confusion, the policy begins by saying that it does not believe that most metadata is subject to copyright in the first place. Then, to make sure, it adds:

To the extent that the DPLA’s own contributions to selecting and arranging such metadata may be protected by copyright, the DPLA dedicates such contributions to the public domain pursuant to a CC0 license.

And then, clearly and plainly:

Given the purposes of the policy and the copyright status of the metadata, and pursuant to the DPLA’s terms of service, the DPLA ‘s users are free to harvest, collect, modify, and/or otherwise use any metadata contained in the DPLA.

Nice!

2 Comments »

December 18, 2012

[misc] I bet your ontology never thought of this one!

Paul Deschner and I had a fascinating conversation yesterday with Jeffrey Wallman, head of the Tibetan Buddhist Resource Center about perhaps getting his group’s metadata to interoperate with the library metadata we’ve been gathering. The TBRC has a fantastic collection of Tibetan books. So we were talking about the schemas we use — a schema being the set of slots you create for the data you capture. For example, if you’re gathering information about books, you’d have a schema that has slots for title, author, date, publisher, etc. Depending on your needs, you might also include slots for whether there are color illustrations, is the original cover still on it, and has anyone underlined any passages. It turns out that the Tibetan concept of a book is quite a bit different than the West’s, which raises interesting questions about how to capture and express that data in ways that can be useful mashed up.


But it was when we moved on to talking about our author schemas that Jeffrey listed one type of metadata that I would never, ever have thought to include in a schema: reincarnation. It is important for Tibetans to know that Author A is a reincarnation of Author B. And I can see why that would be a crucial bit of information.


So, let this be a lesson: attempts to anticipate all metadata needs are destined to be surprised, sometimes delightfully.

3 Comments »

July 3, 2012

[2b2k]The inevitable messiness of digital metadata

This is cross posted at the Harvard Digital Scholarship blog

Neil Jeffries, research and development manager at the Bodleian Libraries, has posted an excellent op-ed at Wikipedia Signpost about how to best represent scholarly knowledge in an imperfect world.

He sets out two basic assumptions: (1) Data has meaning only within context; (2) We are not going to agree on a single metadata standard. In fact, we could connect those two points: Contexts of meaning are so dependent on the discipline and the user's project and standpoint that it is unlikely that a single metadata standard could suffice. In any case, the proliferation of standards is simply a fact of life at this point.

Given those constraints, he asks, what's the best way to increase the interoperability of the knowledge and data that are accumulating on line at at pace that provokes extremes of anxiety and joy in equal measures? He sees a useful consensus emerging on three points: (a) There are some common and basic types of data across almost all aggregations. (b) There is increasing agreement that these data types have some simple, common properties that suffice to identify them and to give us humans an idea about whether we want to delve deeper. (c) Aggregations themselves are useful for organizing data, even when they are loose webs rather than tight hierarchies. 

Neil then proposes RDF and linked data as appropriate ways to capture the very important relationships among ideas, pointing to the Semantic MediaWiki as a model. But, he says, we need to capture additional metadata that qualifies the data, including who made the assertion, links to differences of scholarly opinion, omissions from the collection, and the quality of the evidence. "Rather than always aiming for objective statements of truth we need to realise that a large amount of knowledge is derived via inference from a limited and imperfect evidence base, especially in the humanities," he says. "Thus we should aim to accurately represent  the state of knowledge about a topic, including omissions, uncertainty and differences of opinion."

Neil's proposals have the strengths of acknowledging the imperfection of any attempt to represent knowledge, and of recognizing that the value of representing knowledge lies mainly in its getting linked it to its sources, its context, its controversies, and to other disciplines. It seems to me that such a system would not only have tremendous pragmatic advantages, for all its messiness and lack of coherence it is in fact a more accurate representation of knowledge than a system that is fully neatened up and nailed down. That is, messiness is not only the price we pay for scaling knowledge aggressively and collaboratively, it is a property of networked knowledge itself. 

 

3 Comments »

June 6, 2012

1,000 downloads

I learned yesterday from Robin Wendler (who worked mightily on the project) that Harvard’s library catalog dataset of 12.3M records has been bulk downloaded a thousand times, excluding the Web spiderings. That seems like an awful lot to me, and makes me happy.

The library catalog dataset comprises bibliographic records of almost all of Harvard Library’s gigantic collection. It’s available under a CC 0 public domain license for bulk download, and can be accessed through an API via the DPLA’s prototype platform. More info here.

1 Comment »

Next Page »