Marshall Breeding gave a talk today to the Harvard Library system as part of its Discoverability Day. Marshall is an expert in discovery systems, i.e., technology that enables library users to find what they need and what they didn’t know they needed, across every medium and metadata boundary.
It’s a stupendously difficult problem, not least because the various providers of the metadata about non-catalog items — journal articles, etc. — don’t cooperate. On top of that, there’s a demand for “single searchbox solutions,” so that you can not only search everything the Googley way, but the results that come back will magically sort themselves in the order of what’s most useful to you. To bring us closer to that result, Marshall said that systems are beginning to use personal profiles and usage data. The personal profile lets the search engine know that you’re an astronomer, so that when you search for “mercury” you’re probably not looking for information about the chemical, the outboard motor company, or Queen. The usage data will let the engine sort based on what your community has voted on with its checkouts, recommendations, etc.
Marshall was careful to stipulate that using profiles or usage data will require user consent. I’m very interested in this because the Library Innovation Lab where I work has created an online library browser — StackLife — that sorts results based on a variety of measures of Harvard community usage. StackLife computes a “stackscore” based on a simple calculation of the number of checkouts by faculty, grad students or undergrads, how many copies are in Harvard’s 73 libraries, and potentially other metrics such as how often it’s put on reserve or called back early. The stackscores are based on 10-year aggregates without any personal identifiers, and with no knowledge of which books were checked out together. And our Awesome Box project, now in more than 40 libraries, provides a returns box into which users can deposit books that they thought were “awesome,” generating particularly delicious user-based (but completely anonymized) data.
Marshall is right: usage data is insanely useful for a community, and I’d love for us to be able to get our hands on more of it. But, I got into a Twitter discussion about the danger of re-identification with Mark Ockerbloom [twitter:jmarkockerbloom] and John Wilbanks [twitter:wilbanks], two people I greatly respect, and I agree that a simple opt-in isn’t enough, because people may not fully recognize the possibility that their info may be made public. So, I had an idea.
Suppose you are not allowed to do a “soft” opt-in, by which I mean an opt-in that requires you to read some terms and ticking a box that permits the sharing of information about what you check out from the library. Instead, you would be clearly told that you are opting-in to publishing your check-outs. Not to letting your checkouts be made public if someone figures out how to get them, or even to making your checkouts public to anyone who asks for them. No, you’d be agreeing to having a public page with your name on it that lists your checkouts. This is a service a lot of people want anyway, but the point would be to make it completely clear to you that ticking the checkbox means that, yes, your checkouts are so visible that they get their own page. And if you want to agree to the “soft” opt-in, but don’t want that public page posted, you can’t.
Presumably the library checkout system would allow you to exempt particular checkouts, but by default they all get posted. That would, I think, drive home what the legal language expressed in the “soft” version really entails.