Joho the Blog » metadata

Everyday Chaos

Too Big to Know

Cluetrain 10th Anniversary

Everything Is Miscellaneous

Small Pieces Loosely Joined

Cluetrain Manifesto

Speaker info

Who am I? (Blog Disclosure Form)

Atom Feed

March 26, 2015

Searching for news media that support Schema.org

Let’s say you have the weird desire to see if a particular online news site is producing news articles that support the Schema.org standard. I just posted a tiny little site — even uglier than usual — that lets you search for a particular news media site. It will return the items on that site that have been classified by that site as newsArticles in the Schema.org standard.

Thanks to a suggestion from Dan Brickley, it’s using a custom search engine from Google. One of the parameters permitted by custom search engines is to only return items that are one of Schema.org’s types. (I’m sure I’m messing up the standards lingo.) All I’ve done is specify newsArticle as the type, and prepended “site:” to whatever search you’re doing, saving you five keystrokes. You’re welcome!

If you get back a bunch of articles, then presumably the site is supporting Schema.org. I think.

Follow me

Categories: programs Tagged with: metadata • programs • schema.org • search • shorenstein Date: March 26th, 2015 dw

2 Comments »

March 12, 2015

Corrections metadata

It’s certain that this has already been suggested many times, and it’s highly likely it’s been implemented at least several times. But here goes:

Currently the convention for correcting an online mistake is to strikethrough the errant text and then put in the correct text. Showing one’s errors is a wonderful norm, for it honors the links others have made to the piece; it’s at best confusing when you post criticism of someone else’s work, but when the reader goes there the errant remarks have been totally excised. It’s also a visible display of humility.

But strikethrough text is a visual cue of a structural meaning. And it conveys only the fact that the text is wrong, not why it’s wrong.

So, why isn’t there Schema.org markup for corrections?

Schema.org is the set of simple markup for adding semantics to plain old Web pages. The reader can’t see the markup, but computers can. The major search engines are behind Schema.org, which means that if you mark up your page with the metadata they’ve specified, the search engines will understand your page better and are likely to up its ranking. (Here’s another post of mine about Schema.org.)

So, imagine there were simple markup you could put into your HTML that would let you note that some bit of text is errant, and let you express (in hidden text):

When the correction was made
Who made it
Who suggested the correction, if anyone.
When it was made
What was wrong with the text
A bit of further explanation

The corrected text might include the same sort of information. Plus, you’d want a way to indicate that these two pieces of text refer to one another; you wouldn’t want a computer getting confused about which correction corrects which errant text.

If this became standard, browsers could choose to display errant texts and their corrections however they’d like. Add-ons could be written to let users interact with corrections in different ways. For example, maybe you like seeing strikethroughs but I’d prefer to be able to hover to see the errant text. Maybe we can sign up to be notified of any corrections to an article, but not corrections that are just grammatical. Maybe we want to be able to do research about the frequency and type of corrections across sources, areas, languages, genders….

Schema.org could drive this through. Unless, of course, it already has.

Be sure to read the comment from Dan Brickley. Dan is deeply involved in Schema.org. (The prior comment is from my former college roommate.)

Follow me

Categories: interop Tagged with: corrections • errors • interoperability • metadata Date: March 12th, 2015 dw

5 Comments »

March 6, 2014

Dan Cohen on the DPLA’s cloud proposal to the FCC

I’ve posted a podcast interview with Dan Cohen, the executive director of the Digital Public Library of America about their proposal to the FCC.

The FCC is looking for ways to modernize the E-Rate program that has brought the Internet to libraries and schools. The DPLA is proposing DPLA Local, which will enable libraries to create online digital collections using the DPLA’s platform.

I’m excited about this for two reasons beyond the service it would provide.

First, it could be a first step toward providing cloud-based library services, instead of the proprietary, closed, expensive systems libraries typically use to manage their data. (Evergreen, I’m not talking about you, you open source scamp!)

Second, as libraries build their collections using DPLA Local, their metadata is likely to assume normalized forms, which means that we should get cross-collection discovery and semantic riches.

Here’s the proposal itself. And here’s where you can comment to the FCC about it.

Follow me

Categories: dpla, libraries Tagged with: dpla • fcc • librarycloud • metadata Date: March 6th, 2014 dw

Be the first to comment »

December 24, 2013

Schema.org…now for datasets!

I had a chance to talk with Dan Brickley today, a semanticizer of the Web whom I greatly admire. He’s often referred to as a co-creator of FOAF, but these days he’s at Google working on Schema.org. He pointed me to the work Schema has been doing with online datasets, which I hadn’t been aware of. Very interesting.

Schema.org, as you probably know, provides a set of terms you can hide inside the HTML of your page that annotate what the visible contents are about. The major search engines — Google, Bing, Yahoo, Yandex — notice this markup and use it to provide more precise search results, and also to display results in ways that present the information more usefully. For example, if a recipe on a page is marked up with Schema.org terms, the search engine can identify the list of ingredients and let you search on them (“Please find all recipes that use butter but not garlic”) and display them in a more readable away. And of course it’s not just the search engines that can do this; any app that is looking at the HTML of a page can also read the Schema markup. There are Schema.org schemas for an ever-expanding list of types of information…and now datasets.

If you go to Schema.org/Dataset and scroll to the bottom where it says “Properties from Dataset,” you’ll see the terms you can insert into a page that talk specifically about the dataset referenced. It’s quite simple at this point, which is an advantage of Schema.org overall. But you can see some of the power of even this minimal set of terms over at Google’s experimental Schema Labs page where there are two examples.

The first example (click on the “view” button) does a specialized Google search looking for pages that have been marked up with Schema’s Dataset terms. In the search box, try “parking,” or perhaps “military.” Clicking on a return takes you to the original page that provides access to the dataset.

The second demo lets you search for databases related to education via the work done by LRMI (Learning Resource Metadata Initiative); the LRMI work has been accepted (except for the term useRightsUrl) as part of Schema.org. Click on the “view” button and you’ll be taken to a page with a search box, and a menu that lets you search the entire Web or a curated list. Choose “entire Web” and type in a search term such as “calculus.”

This is such a nice extension of Schema.org. Schema was designed initially to let computers parse information on human-readable pages (“Aha! ‘Butter’ on this page is being used as a recipe ingredient and on that page as a movie title“), but now it can be used to enable computers to pull together human-readable lists of available datasets.

I continue to be a fan of Schema because of its simplicity and pragmatism, and, because the major search engines look for Schema markup, people have a compelling reason to add markup to their pages. Obviously Schema is far from the only metadata scheme we need, nor does it pretend to be. But for fans of loose, messy, imperfect projects that actually get stuff done, Schema is a real step forward that keeps taking more steps forward.

Follow me

Categories: big data, everythingIsMiscellaneous Tagged with: big data • everythingIsMiscellaneous • metadata Date: December 24th, 2013 dw

Be the first to comment »

December 22, 2013

The Bogotá Manhattan recipe + markup

Here’s a recipe for a Manhattan cocktail that I like. The idea of adding Kahlua came from a bartender in Philadelphia. I call it a Bogotá Manhattan because of the coffee.

You can’t tell by looking at this post that it’s marked up with Schema.org codes, unless you View Source. These codes let the search engines (and any other computer program that cares to look) recognize the meaning of the various elements. For example, the line “a splash of Kahlua” actually reads:

<span itemprop=”ingredients”>a splash of Kahlua</span>

“itemprop=ingredients” says that the visible content is an ingredient. This does not help you as a reader at all, but it means that a search engine can confidentally include this recipe when someone searches for recipes that contain Kahlua. Markup makes the Web smarter, and Schema.org is a lightweight, practical way of adding markup, with the huge incentive that the major search engines recognize Schema.

So, here goes:

Bogotá Manhattan

David WeinbergerDecember 22, 2013

A variation on the classic Manhattan — a bit less bitter, and a bit more complex.

Prep Time: 3 minutes
Yield: 1 drink

Ingredients:

1 shot bourbon
1 shot sweet Vermouth
A few shakes of Angostura bitters
A splash of Kahlua
A smaller splash of grenadine or maraschino cherry juice
1 maraschino cherry and/or small slice of orange as garnish. Delicious garnish.

Instructions:

Shake together with ice. Strain and serve in a martini glass, or (my preference) violate all norms by serving in a small glass with ice.

Here’s the Schema.org markup for recipes. author url

Follow me

Categories: everythingIsMiscellaneous Tagged with: cocktails • everythingIsMiscellaneous • markup • metadata • recipes • schema Date: December 22nd, 2013 dw

6 Comments »

August 4, 2013

Paradata

Hanan Cohen points me to a blog post by a MLIS student at Haifa U., named Shir, in which she discourses on the term “paradata.” Shir cites Mark Sample who in 2011 posted a talk he had given at an academic conference, Mark notes the term’s original meaning:

In the social sciences, paradata refers to data about the data collection process itself—say the date or time of a survey, or other information about how a survey was conducted.

Mark intends to give it another meaning, without claiming to have worked it out fully. :

…paradata is metadata at a threshold, or paraphrasing Genette, data that exists in a zone between metadata and not metadata. At the same time, in many cases it’s data that’s so flawed, so imperfect that it actually tells us more than compliant, well-structured metadata does.

His example is We Feel Fine, a collection of tens of thousands (or more … I can’t open the site because Amtrak blocks access to what it intuits might be intensive multimedia) of sentences that begin “I feel” from many, many blogs. We Feel Fine then displays the stats in interesting visualizations. Mark writes:

…clicking the Age visualizations tells us that 1,223 (of the most recent 1,500) feelings have no age information attached to them. Similarly, the Location visualization draws attention to the large number of blog posts that lack any metadata regarding their location.

Unlike many other massive datamining projects, say, Google’s Ngram Viewer, We Feel Fine turns its missing metadata into a new source of information. In a kind of playful return of the repressed, the missing metadata is colorfully highlighted—it becomes paradata. The null set finds representation in We Feel Fine.

So, that’s one sense of paradata. But later Mark makes it clear (I think) that We Feel Fine presents paradata in a broader sense: it is sloppy in its data collection. It strips out HTML formatting, which can contain information about the intensity or quality of the statements of feeling the project records. It’s lazy in deciding which images from a target site it captures as relevant to the statement of feeling. Yet, Mark finds great value in We Feel Fine.

His first example, where the null set is itself metadata, seems unquestionably useful. It applies to any unbounded data set. For example, that no one chose answer A on a multiple choice test is not paradata, just as the fact that no one has checked out a particular item from a library is not paradata. But that no one used the word “maybe” in an essay test is paradata, as would be the fact that no one has checked out books in Aramaic and Klingon in one bundle. Getting a zero in a metadata category is not paradata; getting a null in a category that had not been anticipated is paradata. Paradata should therefore include which metadata categories are missing from a schema. E.g., that Dublin Core does not have a field devoted to reincarnation says something about the fact that it was not developed by Tibetans.

But I don’t think that’s at the heart of what Mark means by paradata. Rather, the appearance of the null set is just one benefit of considering paradata. Indeed, I think I’d call this “implicit metadata” or “derived metadata,” not “paradata.”

The fuller sense of paradata Mark suggests — “data that exists in a zone between metadata and not metadata” — is both useful and, as he cheerfully acknowleges, “a big mess.” It immediately raises questions about the differences between paradata and pseudodata: if We Feel Fine were being sloppy without intending to be, and if it were presenting its “findings” as rigorously refined data at, say, the biennial meeting of the Society for Textual Analysis, I don’t think Mark would be happy to call it paradata.

Mark concludes his talk by pointing at four positive characteristics of the We Feel Fine site:? It’s inviting, paradata, open, and juicy. (“Juicy” means that there’s lots going on and lots to engage you.) It seems to me that the site’s only an example of paradata because of the other three. If it were a jargon-filled, pompous site making claims to academic rigor, the paradata would be pseudodata.

This isn’t an objection or a criticism. In fact, it’s the opposite. Mark’s post, which is based on a talk that he gave at the Society for Textual Analysis, is a plea for research thatis inviting, open, juicy, and is willing to acknowledge that its ideas are unfinished. Mark’s post is, of course, paradata.

Follow me

Categories: everythingIsMiscellaneous, libraries Tagged with: everythingIsMiscellaneous • libraries • metadata • paradata Date: August 4th, 2013 dw

Be the first to comment »

June 22, 2013

What I learned at LODLAM

On Wednesday and Thursday I went to the second LODLAM (linked open data for libraries, archives, and museums) unconference, in Montreal. I’d attended the first one in San Francisco two years ago, and this one was almost as exciting — “almost” because the first one had more of a new car smell to it. This is a sign of progress and by no means is a complaint. It’s a great conference.

But, because it was an unconference with up to eight simultaneous sessions, there was no possibility of any single human being getting a full overview. Instead, here are some overall impressions based upon my particular path through the event.

Serious progress is being made. E.g., Cornell announced it will be switching to a full LOD library implementation in the Fall. There are lots of great projects and initiatives already underway.
Some very competent tools have been developed for converting to LOD and for managing LOD implementations. The development of tools is obviously crucial.
There isn’t obvious agreement about the standard ways of doing most things. There’s innovation, re-invention, and lots of lively discussion.
Some of the most interesting and controversial discussions were about whether libraries are being too library-centric and not web-centric enough. I find this hugely complex and don’t pretend to understand all the issues. (Also, I find myself — perhaps unreasonably — flashing back to the Standards Wars in the late 1980s.) Anyway, the argument crystallized to some degree around BIBFRAME, the Library of Congress’ initiative to replace and surpass MARC. The criticism raised in a couple of sessions was that Bibframe (I find the all caps to be too shouty) represents how libraries think about data, and not how the Web thinks, so that if Bibframe gets the bib data right for libraries, Web apps may have trouble making sense of it. For example, Bibframe is creating its own vocabulary for talking about properties that other Web standards already have names for. The argument is that if you want Bibframe to make bib data widely available, it should use those other vocabularies (or, more precisely, namespaces). Kevin Ford, who leads the Bibframe initiative, responds that you can always map other vocabs onto Bibframe’s, and while Richard Wallis of OCLC is enthusiastic about the very webby Schema.org vocabulary for bib data, he believes that Bibframe definitely has a place in the ecosystem. Corey Harper and Debra Riley-Huff, on the other hand, gave strong voice to the cultural differences. (If you want to delve into the mapping question, explore the argument about whether Bibframe’s annotation framework maps to Open Annotation.)

I should add that although there were some strong disagreements about this at LODLAM, the participants seem to be genuinely respectful.

LOD remains really really hard. It is not a natural way of thinking about things. Of course, neither are old-fashioned database schemas, but schemas map better to a familiar forms-based view of the world: you fill in a form and you get a record. Linked data doesn’t even think in terms of records. Even with the new generation of tools, linked data is hard.
LOD is the future for library, archive, and museum data.

Here’s a list of brief video interviews I did at LODLAM:

Categories: everythingIsMiscellaneous, libraries Tagged with: everythingIsMiscellaneous • libraries • linked data • lodlam • metadata • standards Date: June 22nd, 2013 dw

Be the first to comment »

June 20, 2013

[lodlam] Richard Wallis on Schema.org

Richard Wallis [twitter: rjw] of OCLC explains the appeal of Schema.org for libraries, and its place in the ecosystem.

Follow me

Categories: everythingIsMiscellaneous, libraries, podcast Tagged with: bibframe • libraries • lodlam • metadata • schema.org Date: June 20th, 2013 dw

1 Comment »

February 17, 2013

DPLA does metadata right

The Digital Public Library of America‘s policy on metadata was discussed during the recent board of directors call, and the DPLA is, in my opinion, getting it exactly and admirably right. (See Infodocket for links.) The metadata that the DPLA aggregates will be openly available and in the public domain. But just so there won’t be any doubt or confusion, the policy begins by saying that it does not believe that most metadata is subject to copyright in the first place. Then, to make sure, it adds:

To the extent that the DPLA’s own contributions to selecting and arranging such metadata may be protected by copyright, the DPLA dedicates such contributions to the public domain pursuant to a CC0 license.

And then, clearly and plainly:

Given the purposes of the policy and the copyright status of the metadata, and pursuant to the DPLA’s terms of service, the DPLA ‘s users are free to harvest, collect, modify, and/or otherwise use any metadata contained in the DPLA.

Nice!

Follow me

Categories: open access Tagged with: copyleft • copyright • dpla • libraries • metadata Date: February 17th, 2013 dw

3 Comments »

December 18, 2012

[misc] I bet your ontology never thought of this one!

Paul Deschner and I had a fascinating conversation yesterday with Jeffrey Wallman, head of the Tibetan Buddhist Resource Center about perhaps getting his group’s metadata to interoperate with the library metadata we’ve been gathering. The TBRC has a fantastic collection of Tibetan books. So we were talking about the schemas we use — a schema being the set of slots you create for the data you capture. For example, if you’re gathering information about books, you’d have a schema that has slots for title, author, date, publisher, etc. Depending on your needs, you might also include slots for whether there are color illustrations, is the original cover still on it, and has anyone underlined any passages. It turns out that the Tibetan concept of a book is quite a bit different than the West’s, which raises interesting questions about how to capture and express that data in ways that can be useful mashed up.

But it was when we moved on to talking about our author schemas that Jeffrey listed one type of metadata that I would never, ever have thought to include in a schema: reincarnation. It is important for Tibetans to know that Author A is a reincarnation of Author B. And I can see why that would be a crucial bit of information.

So, let this be a lesson: attempts to anticipate all metadata needs are destined to be surprised, sometimes delightfully.

Follow me

Categories: everythingIsMiscellaneous, libraries Tagged with: everythingIsMiscellaneous • metadata • ontologies • tibet Date: December 18th, 2012 dw

3 Comments »