A mailing list I’m on is discussing GenderAvenger.com. Here’s the text from the home page:
Be A Gender Avenger Don’t Accept It. Change It.
Panel of all men? Conference with no women speakers? Book of essays with no women authors? Do something, something simple: Point it out. Opportunities — sadly — abound. How could that be in 2013? They can be found among iconic institutions and in seemingly small bore infractions.
Seeing can be believing. Everywhere possible when women are unrepresented or underrepresented, a gender avenger will take note, take action or ask someone else to take action. No excuses. This effort requires speaking out even when it is uncomfortable. Try it. The outcome could make you smile or groan. Either way you will have a story to tell that could influence others.
The site does a poor job of explaining exactly what it wants by way of input and what the outcome will be, but the email you receive if you decide to sign up anyway cites a HuffPo article about the idea, encourages you to publicize male-dominated conferences, etc., and asks for your participation in a discussion about how to make the idea work.
aÂ·venge [uh-venj] verb (used with object), aÂ·venged, aÂ·vengÂ·ing. 1. to take vengeance or exact satisfaction for: to avenge a grave insult. 2. to take vengeance on behalf of: He avenged his brother.
This person knows that we know (and Gina Glanz, the site’s creator, knows) what the word “avenger” means. He’s not correcting a misuse, the way he might if she’d used “revenge” as a verb. So why is he telling us what he knows we all already know?
Very likely he’s saying that the way people take a word is how the word is defined in a dictionary. But since this mailing list has been together for well over a decade, and since no one on it has ever recommended violent action (it’s moderated by a pacifist), and since the language of the site itself talks about “speaking out even when it’s uncomfortable,” to think that the site or its supporters mean “vengeance” in its dictionary sense requires dropping a whole lot of context in favor of a slavish devotion to Mr. Webster. It would be perfectly reasonable to push back on the word because it carries bad connotations or because it doesn’t quite fit the intended meaning, but neither of those conversations is advanced by citing the dictionary definition of a common word. Rather, the argument is over territory beyond the sovereignty of a dictionary.
In short (or as the kids say, TL;DR), if you’re citing a definition of a word that everyone understands, you’re probably missing the point.
Hanan Cohen points me to a blog post by a MLIS student at Haifa U., named Shir, in which she discourses on the term “paradata.” Shir cites Mark Sample who in 2011 posted a talk he had given at an academic conference, Mark notes the term’s original meaning:
In the social sciences, paradata refers to data about the data collection process itself—say the date or time of a survey, or other information about how a survey was conducted.
Mark intends to give it another meaning, without claiming to have worked it out fully. :
…paradata is metadata at a threshold, or paraphrasing Genette, data that exists in a zone between metadata and not metadata. At the same time, in many cases it’s data that’s so flawed, so imperfect that it actually tells us more than compliant, well-structured metadata does.
His example is We Feel Fine, a collection of tens of thousands (or more … I can’t open the site because Amtrak blocks access to what it intuits might be intensive multimedia) of sentences that begin “I feel” from many, many blogs. We Feel Fine then displays the stats in interesting visualizations. Mark writes:
…clicking the Age visualizations tells us that 1,223 (of the most recent 1,500) feelings have no age information attached to them. Similarly, the Location visualization draws attention to the large number of blog posts that lack any metadata regarding their location.
Unlike many other massive datamining projects, say, Google’s Ngram Viewer, We Feel Fine turns its missing metadata into a new source of information. In a kind of playful return of the repressed, the missing metadata is colorfully highlighted—it becomes paradata. The null set finds representation in We Feel Fine.
So, that’s one sense of paradata. But later Mark makes it clear (I think) that We Feel Fine presents paradata in a broader sense: it is sloppy in its data collection. It strips out HTML formatting, which can contain information about the intensity or quality of the statements of feeling the project records. It’s lazy in deciding which images from a target site it captures as relevant to the statement of feeling. Yet, Mark finds great value in We Feel Fine.
His first example, where the null set is itself metadata, seems unquestionably useful. It applies to any unbounded data set. For example, that no one chose answer A on a multiple choice test is not paradata, just as the fact that no one has checked out a particular item from a library is not paradata. But that no one used the word “maybe” in an essay test is paradata, as would be the fact that no one has checked out books in Aramaic and Klingon in one bundle. Getting a zero in a metadata category is not paradata; getting a null in a category that had not been anticipated is paradata. Paradata should therefore include which metadata categories are missing from a schema. E.g., that Dublin Core does not have a field devoted to reincarnation says something about the fact that it was not developed by Tibetans.
But I don’t think that’s at the heart of what Mark means by paradata. Rather, the appearance of the null set is just one benefit of considering paradata. Indeed, I think I’d call this “implicit metadata” or “derived metadata,” not “paradata.”
The fuller sense of paradata Mark suggests — “data that exists in a zone between metadata and not metadata” — is both useful and, as he cheerfully acknowleges, “a big mess.” It immediately raises questions about the differences between paradata and pseudodata: if We Feel Fine were being sloppy without intending to be, and if it were presenting its “findings” as rigorously refined data at, say, the biennial meeting of the Society for Textual Analysis, I don’t think Mark would be happy to call it paradata.
Mark concludes his talk by pointing at four positive characteristics of the We Feel Fine site:? It’s inviting, paradata, open, and juicy. (“Juicy” means that there’s lots going on and lots to engage you.) It seems to me that the site’s only an example of paradata because of the other three. If it were a jargon-filled, pompous site making claims to academic rigor, the paradata would be pseudodata.
This isn’t an objection or a criticism. In fact, it’s the opposite. Mark’s post, which is based on a talk that he gave at the Society for Textual Analysis, is a plea for research thatis inviting, open, juicy, and is willing to acknowledge that its ideas are unfinished. Mark’s post is, of course, paradata.
On Wednesday and Thursday I went to the second LODLAM (linked open data for libraries, archives, and museums) unconference, in Montreal. I’d attended the first one in San Francisco two years ago, and this one was almost as exciting — “almost” because the first one had more of a new car smell to it. This is a sign of progress and by no means is a complaint. It’s a great conference.
But, because it was an unconference with up to eight simultaneous sessions, there was no possibility of any single human being getting a full overview. Instead, here are some overall impressions based upon my particular path through the event.
Serious progress is being made. E.g., Cornell announced it will be switching to a full LOD library implementation in the Fall. There are lots of great projects and initiatives already underway.
Some very competent tools have been developed for converting to LOD and for managing LOD implementations. The development of tools is obviously crucial.
There isn’t obvious agreement about the standard ways of doing most things. There’s innovation, re-invention, and lots of lively discussion.
Some of the most interesting and controversial discussions were about whether libraries are being too library-centric and not web-centric enough. I find this hugely complex and don’t pretend to understand all the issues. (Also, I find myself — perhaps unreasonably — flashing back to the Standards Wars in the late 1980s.) Anyway, the argument crystallized to some degree around BIBFRAME, the Library of Congress’ initiative to replace and surpass MARC. The criticism raised in a couple of sessions was that Bibframe (I find the all caps to be too shouty) represents how libraries think about data, and not how the Web thinks, so that if Bibframe gets the bib data right for libraries, Web apps may have trouble making sense of it. For example, Bibframe is creating its own vocabulary for talking about properties that other Web standards already have names for. The argument is that if you want Bibframe to make bib data widely available, it should use those other vocabularies (or, more precisely, namespaces). Kevin Ford, who leads the Bibframe initiative, responds that you can always map other vocabs onto Bibframe’s, and while Richard Wallis of OCLC is enthusiastic about the very webby Schema.org vocabulary for bib data, he believes that Bibframe definitely has a place in the ecosystem. Corey Harper and Debra Riley-Huff, on the other hand, gave strong voice to the cultural differences. (If you want to delve into the mapping question, explore the argument about whether Bibframe’s annotation framework maps to Open Annotation.)
I should add that although there were some strong disagreements about this at LODLAM, the participants seem to be genuinely respectful.
LOD remains really really hard. It is not a natural way of thinking about things. Of course, neither are old-fashioned database schemas, but schemas map better to a familiar forms-based view of the world: you fill in a form and you get a record. Linked data doesn’t even think in terms of records. Even with the new generation of tools, linked data is hard.
LOD is the future for library, archive, and museum data.
Here’s a list of brief video interviews I did at LODLAM:
Kevin Ford from the Library of Congress is talking about BIBFRAME, which he describes as a replacement for MARC and a rethinking of the entire ecosystem.
NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.
(If a response isn’t labeled “Kevin,” then it wasn’t Kevin. Also, this is much compressed, incomplete, and choppy. Also, I haven’t re-read it.)
Q: From the Bibframe mailing list it seems like there isn’t agreement about what Bibframe is trying to achieve.
Kevin: Sometimes people see it narrowly.
Q: It’s not clear how Bibframes gets to where it replaces MARC.
Kevin: We’re not holding back some plan or roadmap that we’ve mapped out perfectly with milestones and target dates. We’re taking it as it comes.
Q: There’s a perception on the part of vendors and customers of vendors that this is a new data specification that vendors will have to support, and that that’s its main function, and possibly that’s pushing the knowledge representation in a direction that’s favorable to the vendors — a direction that’s too simple.
Q: Is there an agreement about the end point?
Kevin: There’s agreement that it needs to do what MARC does but better. We’re doing data representation, not predicting the systems built on top of it.
Q: What are the functional requirements that Bibframe’s trying to meet with this new model? What are your metrics? And who are you trying to satisfy?
Kevin: It’s not vendor focused. We hope systems will be built that expose the data as linked data.
Q: Bibframe let’ you associate a record with a particular work, which is a huge advance.
Q: Bibframe used to talk about roundtripping from MARC to Bibframe to MARC. But Bibframe is now adding info, so I don’t see how roundtripping is possible.
Kevin: Not losslessly.
Q: Bibframe is intended for libraries, but from what I’ve seen it doesn’t seem that Bibframe is intended for use outside of libraries. There doesn’t seem to be any thought about how other ontologies might be overlaid. And that was a problem with MARC: it was too library-centric. Why not investigate mapping it into other vocabularies?
Kevin: Nothing stops you from including other namespaces. As for mapping to other vocabularies, we’re working on a 40 year time scale and can’t know that other vocabularies will be around.
Q: We need some community-building to make that happen. We need to be careful not to build an ontological silo.
Q: The naming of this data set is unfortunate: Why” bib”, which has a connotation of books, when really it should be about any kind of information-bearing object. Why not call it “InfoFrame”? Who uses “bibliographic” other than libraries? Why limit yourself?
Kevin: I cannot begin to tell you how much time was spent on what this thing should be called. It went through a couple of different names. It’s not an ideal name, but I hope that the “bib” association falls by the wayside.
Q: The library ecosystem includes articles, licenses, and many other things that weren’t part of MARC. Is Bibframe aiming at representing all of that?
Kevin: Yes, it’s in scope. Certainly data about journal articles.
Kevin: Yes, Bibframe lets you define your own fields, as in MARC.
Q: We’re going from cataloging to catalinking: from records about resources to links related to topics, etc.
A: We need services that will link resources to other resources. Bibframe doesn’t do that, but it’s more amenable to it than MARC.
Kevin: [Sorry, but I missed the beginning of this.] When it comes to subject headings, we expect you to resolve that URI. If people are doing that every single time, then it’s a candidate for being included. That lookup could be a query into your local system. I’ve assumed you’ll have to have a local copy of it.
Q: Versioning? Why did you ignore the work of the British Library?
Kevin: We didn’t ignore it at all. We need to attend to what’s achievable by the smallest institutions as well as the largest.
Q: For a small institution, is it practical to move away from MARC?
Kevin: Not for some. Some still use card catalogs. I expect some of the first systems will be an outward layer around legacy systems.
Q: We need a larger discussion about provenance and about trust on the semantic web. Libraries should be better participants in that discussion; it’s a deeply important space for us.
Q: This conversation makes me cynical about our profession’s involvement. We need be talking with users. We need community involvement. We’re worried about the longevity of FOAF? It’ll outlast Bibframe because people actually use it. Let’s keep turning inward until we’re completely irrelevant.
Q: Yeah, the idea that there has to be one namespace seems so counter to the principles of linked data.
Q: Do we have anyone outside of the library community here?
A: I’m mainly a web developer. There’s a really big gulf. The Web will win when it comes to how libraries operate. Whether Bibframe will be a part of it remains to be seen. In the web community, everything seems exciting, but I feel so much angst in the library community.
I’ve just finished leading two days of workshops at University of Stuttgart as part of my fellowship at the Internazionales Zentrum für Kultur- und Technikforschung. (No, I taught in English.) This was for me a wonderful experience. First of all, the students were engaged, smart, talked from diverse standpoints, and fun. Second, it reminded me how to teach. I had so much trouble trying to structure sessions, feeling totally unsure how one does so. But the eight 1.5 hour sessions reminded me why I loved teaching.
For my own memory, here are the sessions (and if any of you were there and took notes, I’d love to see them):
#2 Information Age to Age of Connected. Why Ted Nelson’s Xanadu did not succeed the way the Web did. Rough technical architecture of the Net and (perhaps) its embedded political values. Hyperlinks.
#3 Digital order. Everything is miscellaneous? From information Retrieval to search engines. Schema-based databases to tagging.
#4 Networked knowledge. What knowledge looks like once it’s been freed of paper. Four challenges to networked knowledge (with many more added by the students.)
On Saturday we talked about topics that the students decided were interesting:
#1 Mobile net. Is Facebook making us more or less social? Why do we fill up every interstice by using Facebook on mobiles? What does this say about us and the notion of the self?
#2 Downloading. Do you download music illegally? What is your justification? How might artists respond? Why is the term “intellectual property” so loaded?
#3 Education. What makes a great in-person course? What makes for a miserable one? Oddly, many of the characteristics of miserable classes are also characteristics of MOOCs. What might we do about that? How much of this is caused by the fact that MOOCs are construed as courses in the traditional sense?
#4 Internet culture. Is there such a thing? If there are many, is any particular one to be privileged? How does the Net look to a culture that is dedicated to warding off what it says as corrupting influences? End with LolCatBible and the astounding TheJohnnyCashProject
Thank you, students. This experience meant a great deal to me.
NOTE on May 23: OCLC has posted corrected numbers. I’ve corrected them in the post below; the changes are mainly fractional. So you can ignore the note immediately below.
NOTE a couple of hours later: OCLC has discovered a problem with the analysis. So please ignore the following post until further notice. Apologies from the management.
Ever since the 1960s, publishers have used ISBN numbers as identifiers of editions of books. Since the world needs unique ways to refer to unique books, you would think that ISBN would be a splendid solution. Sometimes and in some instances it is. But there are problems, highlighted in the latest analysis run by OCLC on its database of almost 300 million records.
Number of ISBNs
Percentage of the records
So, 78% of the OCLC’s humungous collection of books records have no ISBN, and only 1.6% have the single ISBN that God intended.
As Roy Tennant [twitter: royTennant] of OCLC points out (and thanks to Roy for providing these numbers), many works in this collection of records pre-date the 1960s. Even so, the books with multiple ISBNs reflect the weakness of ISBNs as unique identifiers. ISBNs are essentially SKUs to identify a product. The assigning of ISBNs is left up to publishers, and they assign a new one whenever they need to track a book as an inventory item. This does not always match how the public thinks about books. When you want to refer to, say, Moby-Dick, you probably aren’t distinguishing between one with illustrations, a large-print edition, and one with an introduction by the Deadliest Catch guys. But publishers need to make those distinctions, and that’s who ISBN is intended to serve.
This reflects the more general problem that books are complex objects, and we don’t have settled ways of sorting out all the varieties allowed within the concept of the “same book.” Same book? I doubt it!
Still, these numbers from OCLC exhibit more confusion within the ISBN number space than I’d expected.
MINUTES LATER: Folks on a mailing list are wondering if the very high percentage of records with two ISBNs is due to the introduction of 13-digit ISBNs to supplement the initial 10-digit ones.
Amanda Filipacchi has a great post at the New York Times about the problem with classifying American female novelists as American female novelists. That’s been going on at Wikipedia, with the result that the category American novelist was becoming filled predominantly with male novelists.
Part of this is undoubtedly due to the dumb sexism that thinks that “normal” novelists are men, and thus women novelists need to be called out. And even if the category male novelist starts being used, it still assumes that gender is a primary way of dividing up novelists, once you’ve segregated them by nation. Amanda makes both points.
From my point of view, the problem is inherent in hierarchical taxonomies. They require making decisions not only about the useful ways of slicing up the world, but also about which slices come first. These cuts reflect cultural and political values and have cultural and political consequences. They also get in the way of people who are searching with a different way of organizing the topic in mind. In a case like this, it’d be far better to attach tags to Wikipedia articles so that people can search using whatever parameters they need. That way we get better searchability, and Wikipedia hasn’t put itself in the impossible position of coming up with a taxonomy that is neutral to all points of view.
Wikipedia’s categories have been broken for a long time. We know this in the Library Innovation Lab because a couple of years ago we tried to find every article in Wikipedia that is about a book. In theory, you can just click on the “Book” category. In practice, the membership is not comprehensive. The categories are inconsistent and incomplete. It’s just a mess.
It may be that a massive crowd cannot develop a coherent taxonomy because of the differences in how people think about things. Maybe the crowd isn’t massive enough. Or maybe the process just needs far more guidance and regulation. But even if the crowd can bring order to the taxonomy, I don’t believe it can bring neutrality, because taxonomies are inherently political.
There are problems with letting people tag Wikipedia articles. Spam, for example. And without constraints, people can lard up an object with tags that are meaningful only to them, offensive, or wrong. But there are also social mechanisms for dealing with that. And we’ve been trained by the Web to lower our expectations about the precision and recall afforded by tags, whereas our expectations are high for taxonomies.
I’m very proud to announce that the Harvard Library Innovation Lab (which I co-direct) has launched what we think is a useful and appealing way to browse books at scale. This is timed to coincide with the launch today of the Digital Public Library of America. (Congrats, DPLA!!!)
StackLife (nee ShelfLife) shows you a visualization of books on a scrollable shelf, which we turn sideways so you can read the spines. It always shows you books in a context, on the ground that no book stands alone. You can shift the context instantly, so that you can (for example) see a work on a shelf with all the other books classified under any of the categories professional cataloguers have assigned to it.
We also heatmap the books according to various usage metrics (“StackScore”), so you can get a sense of the work’s community relevance.
There are lots more features, and lots more to come.
We’ve released two versions today.
StackLife DPLA mashes up the books in the Digital Public Library of America’s collection (from the Biodiversity Heritage Library) with books from The Internet Archive‘s Open Library and the Hathi Trust. These are all online, accessible books, so you can just click and read them. There are 1.7M in the StackLife DPLA metacollection. (Development was funded in part by a Sprint grant from the DPLA. Thank you, DPLA!)
StackLife Harvard lets you browse the 12.3M books and other items in the Harvard Library systems 73 libraries and off-campus repository. This is much less about reading online (unfortunately) than about researching what’s available.