Joho the Blog » everythingIsMiscellaneous

December 23, 2011

[2b2k] Is HuffPo killing the news?

Mathew Ingram has a provocative post at Gigaom defending HuffingtonPost and its ilk from the charge that they over-aggregate news to the point of thievery. I’m not completely convinced by Mathew’s argument, but that’s because I’m not completely convinced by any argument about this.

It’s a very confusing issue if you think of it from the point of view of who owns what. So, take the best of cases, in which HuffPo aggregates from several sources and attributes the reportage appropriately. It’s important to take a best case since we’ll all agree that if HuffPo lifts an article en toto without attribution, it’s simple plagiarism. But that doesn’t tell us if the best cases are also plagiarisms. To make it juicier, assume that in one of these best cases, HuffPo relies heavily on one particular source article. It’s still not a slam dunk case of theft because in this example HuffPo is doing what we teach every school child to do: If you use a source, attribute it.

But, HuffPo isn’t a schoolchild. It’s a business. It’s making money from those aggregations. Ok, but we are fine in general with people selling works that aggregate and attribute. Non-fiction publishing houses that routinely sell books that have lots of footnotes are not thieves. And, as Mathew points out, HuffPo (in its best cases) is adding value to the sources it aggregates.

But, HuffPo’s policy even in its best case can enable it to serve as a substitute for the newspapers it’s aggregating. It thus may be harming the sources its using.

And here we get to what I think is the most important question. If you think about the issue in terms of theft, you’re thrown into a moral morass where the metaphors don’t work reliably. Worse, you may well mix in legal considerations that are not only hard to apply, but that we may not want to apply given the new-ness (itself arguable) of the situation.

But, I find that I am somewhat less conflicted about this if I think about it terms of what direction we’d like to nudge our world. For example, when it comes to copyright I find it helpful to keep in mind that a world full of music and musicians is better than a world in which music is rationed. When it comes to news aggregation, many of us will agree that a world in which news is aggregated and linked widely through the ecosystem is better than one in which you—yes, you, since a rule against HuffPo aggregating sources wouldn’t apply just to HuffPo— have to refrain from citing a source for fear that you’ll cross some arbitrary limit. We are a healthier society if we are aggregating, re-aggregating, contextualizing, re-using, evaluating, and linking to as many sources as we want.

Now, beginning by thinking where we want the world to be —which, by the way, is what this country’s Founders did when they put copyright into the Constitution in the first place: “to promote the progress of science and useful arts”—is useful but limited, because to get the desired situation in which we can aggregate with abandon, we need the original journalistic sources to survive. If HuffPo and its ilk genuinely are substituting for newspapers economically, then it seems we can’t get to where we want without limiting the right to aggregate.

And that’s why I’m conflicted. I don’t believe that even if all rights to aggregate were removed (which no one is proposing), newspapers would bounce back. At this point, I’d guess that the Net generation is primarily interested in news mainly insofar as its woven together and woven into the larger fabric. Traditional reportage is becoming valued more as an ingredient than a finished product. It’s the aggregators—the HuffingtonPosts of the world, but also the millions of bloggers, tweeters and retweeters, Facebook likers and Google plus-ers, redditors and slashdotters, BoingBoings and Ars Technicas— who are spreading the news by adding value to it. News now only moves if we’re interested enough in it to pass it along. So, I don’t know how to solve journalism’s deep problems with its business models, but I can’t imagine that limiting the circulation of ideas will help, since in this case, the circulatory flow is what’s keeping the heart beating.

 


[A few minutes later] Mathew has also posted what reads like a companion piece, about how Amazon’s Kindle Singles are supporting journalism.

4 Comments »

December 9, 2011

CBC interview with me about library stuff

The CBC has posted the full, unedited interview with me (15 mins) that Nora Young did last week. We talk about the Harvard Library Lab’s two big projects, ShelfLife and LibraryCloud. (At the end, we talk a little about Too Big To Know.) The edited interview will be on the Spark program.

1 Comment »

November 22, 2011

Physical libraries in a digital world

I’m at the final meeting of a Harvard course on the future of libraries, led by John Palfrey and Jeffrey Schnapp. They have three guests in to talk about physical library space.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

David Lamberth lays out an idea as a provocation. He begins by pointing out that until the beginning of the 20th century, a library was not a place but only a collection of books. He gives a quick history of Harvard Library. After the library burned down in 1764, the libraries lived in fear of fire, until electric lights came in. The replacement library (Gore Hall) was built out of stone because brick structures need wood on the inside. But stone structures are dank, and many books had to be re-bound every 30 years. Once it filled up, 25-30 of Harvard libraries derived from the search for fireproof buildings, which helps explain the large distribution of libraries across campus. They also developed more than 40 different classification systems. At the beginning of the 20th C, Harvard’s collection was just over one million. Now it adds up to around 18M. [David's presentation was not choppy, the way this paraphrase is.]

In the 1980s, there was continuing debate about what to do about the need for space. The big issue was open or closed stacks. The faculty wanted the books on site so they could be browsed. But stack space is expensive and you tend to outgrow it faster than you think. So, it was decided not to build any more stack space. There already was an offsite repository (New England Book Depository), but it was decided to build a high density storage facility to remove the non-active parts of the collection to a cheaper, off-site space: The Harvard Depository (HD).

Now more than 40% of the physical collections are at HD. The Faculty of Arts and Sciences started out hostile to the idea, but “soon became converted.” The notion faculty had of browsing the shelves was based on a fantasy: Harvard had never had all the books on a subject on a shelf in a single facility. E.g., search on “Shakespeare” in the Harvard library system: 18,000 hits. Widener Library is where you’d expect to find Shakespeare books. But 8,000 of the volumes aren’t in Widener. Of Widener’s 10K Shakespeare, volumes, 4,500 are in HD. So, 25% of what you meant to browse is there. “Shelf browsing is a waste of time” if you’re trying to do thorough research. It’s a little better in the smaller libraries, but the future is not in shelf browsing. Open and closed stacks isn’t the question any more. “It’s just not possible any longer to do shelf browsing, unless we develop tools for browsing in a non-physical fashion.” E.g., catalog browsers, and ShelfLife (with StackView).

There’s nobody in the stacks any more. “It’s like the zombies have come and cleared people out.” People have new alternatives, and new habits. “But we have real challenges making sure they do as thorough research as possible, and that we leverage our collection.” About 12M of the 18M items are barcoded.

A task force saw that within 40 years, over 70% of the physical collection will be off site. HD was not designed to hold the part of the collection most people want to use. So, what can do that will give us pedagogical and intellectual benefit, and realizes the incredible resource that our collection is?

Let me present one idea, says David. The Library Task Force said emphatically that Harvard’s collection should be seen as one collection. It makes sense intellectually and financially. But that idea is in contention with the 56 physical libraries at Harvard. Also, most of our collection doesn’t circulate. Only some of it is digitally browsable, and some of that won’t change for a long long long time. E.g., our Arabic journals in Widener aren’t indexed, don’t publish cumulative indexes, and are very hard to index. Thus scholars need to be able to pull them off the shelves. Likewise for big collections of manuscripts that haven’t even been sorted yet.

One idea would be to say: Let’s treat physical libraries as one place as well. Think of them as contiguous, even though they’re not. What if bar-coded books stayed in the library you returned to them to? Not shelved by a taxonomy. Random access via the digital, and it tells you where the work is. And build perfect shelves for the works that need to be physically organized. Let’s build perfect Shakespeare shelves. Put them in one building. The other less-used works will be findable, but not browsable. This would require investing in better findability systems, but it would let us get past the arbitrariness of classification systems. Already David will usually go to Amazon to decide if he wants a book rather than take the 5 mins to walk to the library. By focusing on perfect shelves for what is most important to be browsable, resources would be freed up. This might make more space in the physical libraries, so “we could think about what the people in those buildings want to be doing,” so people would come in because there’s more going on. (David notes that this model will not go over well with many of his colleagues.)

53% of library space at Harvard is stack space. The other 47% is split between patron space and space staff. About 20-25% is space staff. Comparatively, Harvard is lower on patron space size than typical. The HD is holding half the collection in 20% of the space. It’s 4x as expensive to store a work on a stack on campus than off.

David responds to a question: The perfect shelves should be dynamic, not permanent. That will better serve the evolution of research. There are independent variables: Classification and shelf location. We certainly need classification, but it may not need to map to shelf locations. Widener has bibliographic lists and shelf lists. Barcodes give us more freedom; we don’t have to constantly return works to fixed locations.

Mike Barker: Students already build their own perfect shelves with carrels.

Q: What’s the case for ownership and retention if we’re only addressing temporal faculty needs?

A lot of the collecting in the first half of the 20 C was driven by faculty requests. Not now. The question of retention and purchase splits on the basis of how uncommon the piece of info is. If it’s being sold by Amazon, I don’t think it really matters if we retain it, because of the number of copies and the archival steps already in place. The more rare the work, the more we should think about purchase and retention. But under a third of the stack space on campus ideal environmental conditions. We shouldn’t put works we buy into those circumstances unless they’re being used.

Q: At the Law Library, we’re trying to spread it out so that not everyone is buying the same stuff. E.g., we buy Peruvian materials because other libraries aren’t. And many law books are not available digitally, so we we buy them … but we only buy one copy.

Yes, you’re making an assessment. In the Divinity library, Mike looked at the duplication rate. It was 53%. That is, 53% of our works are duplicated in other Harvard libraries.

Mike: How much do we spend on classification? To create call numbers? We annually spend about 1.5-2M on it, plus another million shelving it. So, $3M-3.5M total. (Mike warns that this is a “very squishy” number.) We circulate about 700,000 items a years. The total operating budget of the Library is about $152M. (He derived this number by asking catalogers who long it takes to classify an item without one, divided into salary.)

David: Scanning in tables of contents, indexes, etc., lets people find things without having to anticipate what they’re going to be interested in.

Q: Where does serendipity fall in this? What about when you don’t know what you’re looking for?

David: I agree completely. My dissertation depended on a book that no one had checked out since 1910. I found it on the stacks. But it’s not on the shelves now. Suppose I could ask a research librarian to bring me two shelves worth of stuff because I’m beginning to explore some area.

Q: What you’re suggesting won’t work so well for students. How would not having stacks affect students?

David: I’m being provocative but concrete. The status quo is not delivering what we think it does, and it hasn’t for the past three decades.

Q: [jeff goldenson] Public librarians tell us that the recently returned trucks are the most interesting place to go. We don’t really have the ability to see what’s moving in the Harvard system. Yes, there are privacy concerns, but just showing what books have been returned would be great.

Q: [palfrey] How much does the rise of the digital affect this idea? Also, you’ve said that the storage cost of a digital object may be more than that of physical objects. How does that affect this idea?

David: Copyright law is the big If. It’s not going away. But what kind of access do you have to digital objects that you own? That’s a huge variable. I’ve premised much of what I’ve said on the working notion that we will continue to build physical collections. We don’t know how much it will cost to keep a physical object for a long time. And computer scientists all say that digital objects are not durable. My working notion here is that the parts that are really crucial are the metadata pieces, which are more easily re-buildable if you have the physical objects. We’re not going to buy physical objects for all the digital items, so the selection principle goes back to how grey or black the items are. It depends on whether we get past the engineering question about digital durability — which depends a lot on electromagnetism as a storage medium, which may be a flash in the pan. We’re moving incrementally.

Q: [me] If we can identify the high value works that go on perfect shelves, why not just skip the physical shelves and increase the amount of metadata so that people can browse them looking for the sort of info they get from going to the physical shelf?

A: David: Money. We can’t spend too much on the present at the expense of the next century or two. There’s a threshold where you’d say that it’s worth digitizing them to the degree you’d need to replace physical inspection entirely. It’s a considered judgment, which we make, for example, when we decide to digitize exhibitions. You’d want to look at the opportunity costs.

David suggests that maybe the Divinity library (he’s in the Phil Dept.) should remove some stacks to make space for in-stack work and discussion areas. (He stresses that he’s just thinking out loud.)

Matthew Sheehy, who runs HD, says they’re thinking about how to keep books 500 years. They spend $300K/year on electricity to create the right environment. They’ve invested in redundancy. But, the walls of the HD will only last 100 years. [Nov. 25: I may have gotten the following wrong:] He thinks it costs about $1/ year to store a book, not the usual figure of $0.45.

Jeffrey Schnapp: We’re building a library test kitchen. We’re interested in building physical shelves that have digital lives as well.

[Nov. 25: Changed Philosophy school to Divinity, in order to make it correct. Switched the remark about the cost of physical vs. digital in the interest of truth.]

4 Comments »

October 11, 2011

Classifying folktales

Via Metafilter:

The Aarne-Thompson Classification System

Originally published by Finnish forkloristAntti Aarne and expanded by American Stith Thompson and German Hans-Jörg Uther, the Aarne-Thompson Classification System is a system for classifying folktales based on motifs.

Some Examples:
Beauty and the Beast: Type 425C
Bluebeard: 312
The Devil Building a Bridge: Type 1191
The Foolish Use of Magic Wishes Type 750A
Hansel and Gretel and other abandoned children: Type 327
Women forced to marry hogs: Type 441
The Runaway Pancake: Type 2025
Wikipedia has a complete breakdown and here has examples of most of the tale types.

3 Comments »

September 27, 2011

Libraries of the future

We’ve just posted the latest Library Innovation Lab podcast, this one with Karen Coyle who is a leading expert in Linked Open Data. Will we have perpetual but interoperable disagreements about how to classify and categorize works and decide what is the “same” work?

And, if you care about libraries and are in the Cambridge (MA) area on Oct. 4, there’s a kick off event at Sanders Theater at Harvard for a year of conversations about the future of libraries. Sounds great, although I unfortunately will be out of town :(

1 Comment »

June 24, 2011

Tagging the National Archives

The National Archives is going all tag-arrific on us:

The Online Public Access prototype (OPA) just got an exciting new feature — tagging! As you search the catalog, we now invite you to tag any archival description, as well as person and organization name records, with the keywords or labels that are meaningful to you. Our hope is that crowdsourcing tags will enhance the content of our online catalog and help you find the information you seek more quickly.

Nice! (Hat tip to Infodocket for the tip)

Be the first to comment »

June 14, 2011

Linked Open Data take-aways

I just wrote up an informal trip report in the form of “take aways” from the LOD-LAM conference I attended a cople of weeks ago. Here is a lightly edited version.

 


Because it was an unconference, it was too participatory to enable us to take systematic notes. I did, however, interview a number of attendees, and have posted the videos on the Library Innovation Lab blog site. I actually have a few more yet to post. In addition, during the course of one of the sessions (on “Explaining LOD-LAM”), a few of us began constructing a FAQ.

Here’s some of what I took away from the conference.

- There is considerable momentum around linked open data, starting with the sciences where there is particular research value in compiling huge data sets. Many libraries are joining in.

- LOD for libraries will enable a very fluid aggregation of information from multiple types of sources around any particular object. E.g., a page about a Hogarth illustration (or about Hogarth, or about 18th century London, etc.) could quite easily aggregate information from any data set that knows something about that illustration or about topics linked to that illustration. This information could be used to build a page or to do research.

- Making data and metadata available as LOD enables maximal re-use by others.

- Doing so requires expertise, but should be less massively difficult than supporting many other standards.

- For the foreseeable future, this will be something libraries do in addition to supporting more traditional data standards; it will be an additional expense and effort.

- Although there is continuing debate about exactly which license to use when publishing library data sets, it seems that usually putting any form of license on the data other than a public domain waiver of licenses is likely to be (a) futile and (b) so difficult to deal with that it will inhibit re-use of the data, depriving it of value. (See the 4-star license proposal that came out of this conference.)

- The key point of resistance against LOD among libraries, archives and museums is the justified fear that once the data is released into the world, the curating institutions can no longer ensure that the metadata about an object is correct; the users of LOD might pick up a false attribution, inaccurate description, etc. This is a genuine risk, since LOD permits irresponsible use of data. The risk can be mitigated but not removed.

1 Comment »

June 2, 2011

Schema.org

Bing, Google and Yahoo have announced schema.org, where you can find markup to embed in your HTML that will help those search engines figure out whether you’re talking about a movie, a person, a recipe, etc. The markup seems quite simple. But, more important, by using it your page is more likely to be returned when someone is looking for what your page talks about.

Having the Big Three search engines dictating the metadata form is likely to be a successful move. SEO is a powerful motivator.

4 Comments »

[lodlam] The rise of Linked Open Data

At the Linked Open Data in Libraries, Archives and Museums conf [LODLAM], Jonathan Rees casually offered what I thought was useful a distinction. (Also note that I am certainly getting this a little wrong, and could possibly be getting it entirely wrong.)


Background: RDF is the basic format of data in the Semantic Web and LOD; it consists of statements of the form “A is in some relation to B.”


My paraphrase: Before LOD, we were trying to build knowledge representations of the various realms of the world. Therefore, it was important that the RDF triples expressed were true statements about the world. In LOD, triples are taken as a way of expressing data; take your internal data, make it accessible as RDF, and let it go into the wild…or, more exactly, into the commons. You’re not trying to represent the world; you’re just trying to represent your data so that it can be reused. It’s a subtle but big difference.


I also like John Wilbanks‘ provocative tweet-length explanation of LOD: “Linked open data is duct tape that some people mistake for infrastructure. Duct tape is awesome.”


Finally, it’s pretty awesome to be at a techie conference where about half the participants are women.

3 Comments »

May 27, 2011

A Declaration of Metadata Openness

Discovery, the metadata ecology for UK education and research, invites stakeholders to join us in adopting a set of principles to enhance the impact of our knowledge resources for the furtherance of scholarship and innovation…

What follows are a set of principles that are hard to disagree with.

1 Comment »

« Previous Page | Next Page »