Joho the Blog » library

May 13, 2014

Full-text searching Harvard Library: a hacky mashup

Harvard Library has 13M items in its collection. Harvard is digitizing many of them, but as of now you cannot do a full text search of them.

Google Books had 30M books digitized as of a year ago. You can do full-text searches of them.

So, I wrote a little app [Note: I've corrected this url.] that lets you search Google Books for text, and then matches up the results with books in Harvard Library. It’s a proof of concept, and I’m counting the concept as proved, or at least as promising. On the other hand, my API key for Google Books only allows 2,000 queries a day, so it’s not practical on the licensing front.

This project runs on top of LibraryCloud, an open source library metadata server created by the Harvard Library Innovation Lab that I co-direct (until Sept.). LibraryCloud provides an API to Harvard’s open library metadata and more. (We’re building a new, more scalable version now. It is, well, super-cool.)

But please note that this HOLLIS full-text search thingy is NOT a project done by our highly innovative and highly skilled developers. I did it, which means if you look at the code (github) you will have a good laugh. Also, this service will fail in dull and interesting ways. I am a horrible programmer. (But I enjoy it.)

Some details below the clickable screenshot…


Click on the image to expand it.
googleHollis screen capture

Click here to go to the app.

The Google Books results are on the left (only ten for now), and HOLLIS on the right.

If a Google result is yellow, there’s a match with a book in HOLLIS. Gray means no match. HOLLIS book titles are prefaced by a number that refers to the Google results number. Clicking on the Google results number (in the circle) hides or shows those works in the stack on the right; this is because some Google books match lots of items in HOLLIS. (Harvard has a lot of copies of King Lear, for example.)

There are two types of matches. If an item matched on a firm identifier (ISBN,OCLC, LCCN), then there’s a checkmark before the title in the HOLLIS stack, and there’s a “Stacklife” button in the Google list. Clicking on the Stacklife button displays the book in Harvard StackLife, a very cool — and prize winning! — library browser created by our Lab. The StackLife stack colorizes items based on how much they’re used by the Harvard community. The thickness of the book indicates its page count and its length indicates its actual physical height.

If there’s no match on the identifiers, then the page looks for a keyword match on the title and an exact match on the author’s last name. This can result in multiple results, not all of which may be right. So, on the Google result there’s a “Feeling lucky” button that will take you to the first match’s entry in StackLife.

The “Google” button takes you to that item’s page at Google Books, filtered by your search terms for your full-texting convenience.

The “View” button pops up the Google Books viewer for that book, if it’s available.

The “Clear stack” button deselects all the items in the Google results, hiding all the items in the HOLLIS stack.

Let me know how this breaks or sucks, but don’t expect it ever to be a robust piece of software. Remember its source.

Be the first to comment »

April 1, 2013

Podcast about the DPLA’s status and its relation to public libraries

The latest podcast in the Digital Campus series focuses solely on the current state of the Digital Public Library of America. The discussion includes Dan Cohen who has just accepted the position of Executive Director of the DPLA, which is just wonderful news. Not only does he have a rare combination of skills and experiences — ever hear of Zotero, hmm? — but he is also — and there’s no other way of putting this — nice.

Also on the podcast is Nicholas Carr, who wrote an excellent, skeptical (or at least questioning) article for MIT Tech Review on the DPLA a year ago. Also, Mills Kelly and Tom Scheinfeldt. And me.

Dan explains what the DPLA is. Nick wonders if if the DPLA will hurt public libraries. I try to explain why I think it won’t. Amanda suggests the DPLA is the Mr. Potato Head of libraries. I thought it was a good discussion.

1 Comment »

March 28, 2013

[annotation][2b2k] Philip Desenne

I’m at a workshop on annotation at Harvard. Philip Desenne is giving one of the keynotes.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

We’re here to talk about the Web 3.0, Phil says — making the Web more fully semantic.

Phil says that we need to re-write the definition of annotation. We should be talking about hyper-nota: digital media-rich annotations. Annotations are important, he says. Try to imagine social networks with the ratings, stars, comments, etc. Annotations also spawn new scholarship.

The new dew digital annotation paradigm is the gateway to Web 3.0: connecting knowledge through a common semantic language. There are many annotation tools out there. “All are very good in their own media…But none of them share a common model to interoperate.” That’s what we’re going to work on today. “The Open Annotation Framework” is the new digital paradigm. But it’s not a simple model because it’s a complex framework. Phil shows a pyramid: Create / Search / Seek patterns / Analyze / Publish / Share. [Each of these has multiple terms and ideas that I didn't have time to type out.]

Of course we need to abide by open standards. He points to W3C, Open Source and Creative Commons. And annotations need to include multimedia notes. We need to be able to see annotations relating to one another, building networks across the globe. [Knowledge networks FTW!] Hierarchies of meaning allow for richer connections. We can analyze text and other media and connect that metadata. We can look across regional and cultural patterns. We can publish, share and collaborate. All if we have a standard framework.

For this to happeb we beed a standardized referencing system for segments or fragments of a work. We also need to be able to export them into standard formats such as XML TEI.

Lots of work has been done on this: RDF Models and Ontologies, the Open Annotiation Community Group, the Open Annotation Model. “The Open Annotation Model is the common language.”

If we don’t adopt standards for annotation we’ll have disassociated, stagnant info. We’ll dereased innovation research, teaching, and learning knowledge. This is especially an issue when one thinks about MOOCs — a course with 150,000 students creating annotations.

Connective Collective Knowledge has existed for millennia he says. As far back as Aristarchus, marginalia had ymbols to allow pointing to different scrolls in the Library of Alexandria. Where are the connected collective knowledge systems today? Who is networking the commentaries on digital works? “Shouldn’t this be the mission of the 21st century library?”

Harvard has a portal for info about annotations: annotations.harvard.edu

3 Comments »

June 6, 2012

1,000 downloads

I learned yesterday from Robin Wendler (who worked mightily on the project) that Harvard’s library catalog dataset of 12.3M records has been bulk downloaded a thousand times, excluding the Web spiderings. That seems like an awful lot to me, and makes me happy.

The library catalog dataset comprises bibliographic records of almost all of Harvard Library’s gigantic collection. It’s available under a CC 0 public domain license for bulk download, and can be accessed through an API via the DPLA’s prototype platform. More info here.

1 Comment »

April 24, 2012

[2b2k][everythingismisc]“Big data for books”: Harvard puts metadata for 12M library items into the public domain

(Here’s a version of the text of a submission I just made to BoingBong through their “Submitterator”)

Harvard University has today put into the public domain (CC0) full bibliographic information about virtually all the 12M works in its 73 libraries. This is (I believe) the largest and most comprehensive such contribution. The metadata, in the standard MARC21 format, is available for bulk download from Harvard. The University also provided the data to the Digital Public Library of America’s prototype platform for programmatic access via an API. The aim is to make rich data about this cultural heritage openly available to the Web ecosystem so that developers can innovate, and so that other sites can draw upon it.

This is part of Harvard’s new Open Metadata policy which is VERY COOL.

Speaking for myself (see disclosure), I think this is a big deal. Library metadata has been jammed up by licenses and fear. Not only does this make accessible a very high percentage of the most consulted library items, I hope it will help break the floodgates.

(Disclosures: 1. I work in the Harvard Library and have been a very minor player in this process. The credit goes to the Harvard Library’s leaders and the Office of Scholarly Communication, who made this happen. Also: Robin Wendler. (next day:) Also, John Palfrey who initiated this entire thing. 2. I am the interim head of the DPLA prototype platform development team. So, yeah, I’m conflicted out the wazoo on this. But my wazoo and all the rest of me is very very happy today.)

Finally, note that Harvard asks that you respect community norms, including attributing the source of the metadata as appropriate. This holds as well for the data that comes from the OCLC, which is a valuable part of this collection.

16 Comments »

February 13, 2012

[2b2k] BibSoup is in beta

Congratulations to the Open Knowledge Foundation on the launch of BibSoup, a site where anyone can upload and share a bibliography. It’s a great idea, and an awesome addition to the developing knowledge ecosystem.

Be the first to comment »

December 21, 2011

CBC Spark on ShelfLife and LibraryCloud

The CBC show Spark a couple of days ago ran an 8 minute piece about the two biggest projects coming out of the Harvard Library Innovation Lab, ShelfLife and LibraryCloud. It does a great job cutting together an interview of me with an illuminating narrative from Nora Young. (I co-direct the Lab, along with Kim Dulin, although credit for these apps goes to our team: Annie Jo Cain, Paul Deschner, Jeff Goldenson, Matt Phillips, and Andy Silva.)

Spark also has posted the full, uncut interview and a good blog post about it.

1 Comment »

October 21, 2011

[dpla] second session

Maura Marx introduces Jill Cousins of Europeana who says that we all agree that we want to make the contents of libraries, museums and archives archives available for free. We agree on interoperability and open metadata. She encourages us to adopt the Europeana Data Model. Share our source code. Build our collections together. So, we’re starting with a virtual exhibition of migration of Europeans to American. The DPLA and Europeana will demonstrate the value of their combined collections — text and images — by digitizing material and making it available as an exhibition. (Maura thanks Bob Darnton for building European ties.)

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.


Maura Sullivan, president elect of the American Library Association, moderates a panel about visions of the DPLA. Each panelist gets 5-7 minutes


John Palfrey: It’s a bridge we’re building as we walk over it. But it has 5 aspects. 1. Digitizing projects. It’ll be a collection of collections. We should be digitizing in common ways with common formats. But, DPLA will also be: 2. Code. SourceForge for Libraries. Anyone can take and reuse it, including public libraries. 3. Metadata. That’s what makes info findable and usable. It’s the special sauce of librarians. But we haven’t done it yet. We need open access to metadata. 4. Tools and services that ride on top of a common platform. E.g., extraMuros, Scannebagos. 5. Community.


Peggy Rudd, Texas State Library and Archives Commission. We want to see someone walking down the street with a cellphone who says, “I’m going to DPLA it.” We should take as a guiding idea that all people in the country ought to have access to the infrastructure of ideas. We have to think about access. Those of us in public libraries are going to be the digital literacy corps. Public libraries are going to be the institutions that can ensure that people can discover things and will help people evaluate what they find, ensuring what they find is relevant, and help people get the most out of the DPLA.


Brewster Kahle, The Internet Archive. I grew up in a paper world. But I believe the Archivist is right: If it’s not online, it doesn’t exist. There are now two large scale digital library projects in the US. Ten million books are available from a commercial source, and 2M that are public (at OpenLibrary.org). But let’s step back and see where we want to be: Lots of publishers and authors who are paid; a diversity of libraries; everyone can be a reader, no matter what language, proclivities, disabilities. Let’s go and get 10M ebooks. 2M public domain (free), 7M out of print (digitized to be lent), 1M in print (buy ebook and lend them). Libraries ought to ebooks and circulate them, one loan at a time per one book. DPLA ought to help libraries buy new eBooks to lend them, as well as scanning the core 10M book collection, and enable al libraries get the digital collections. At this point, a 10M ebook collections requires about $30K of computers, which is within the budget of many libraries. For this, we would get universal access to all knowledge. How do we stay on track? Follow the money: is the money being well spent. And follow the bits: the bits should be put in many places. “Together we can build a digital America that is free to all.”


Amanda French begins with John Donne, “Sunrising.” [I am here heavily paraphrasing!] For most, the sun rising is a beginning, but for lovers it is an ending. The unruly sun of the digital text is rising, calling us to work, whereas I would rather snuggle in bed with a book. Love can exist in a commercial relationship, but that’s not ideal. I would like a library that supports me in all my moods, from contemplation to raucous sociality. We need proof of love. Physical libraries manifest that love. The DPLA must manifest itself as more than a web site, many quiet and generous services to readers, developers…technical and social. While I agree that if it isn’t online, it doesn’t exist, but if it’s only only online, it only half exists. And I want a physical building. Not just a server center. [Again: I've poorly paraphrased.]


Jill Cousins, Europeana. We want the DPLA because we get access to your stuff. [Laughter] But DPLA can improve on Europeana with open data, Open Source, Open Licensing. Also, we should be interopable. Our new strategic plan has four aspects. 1. Aggregating content as an trusted source. 2. Facilitating, supporting cultural eritage. 3. Distributing: Wherever people are. 4. Engaging: New ways to participate in cultural heritage. Europeana currentlu has 20M items, multiple languages. I’m particularly interested in the APIs so material can be distributed to where people will use it. (She points to content about the US that is in their distributed collection.) To facilitate: Labeling content so users know it’s in the public domain. What’s in the PD in analog form ought to stay in the PD in digital form. Engage: Cultivate new way for users to participate in their cultural heritage. One project: People are asked to bring their memorabilia from WWI. So, why DPLA: We are the generation that can give acccess to the analog past. If we don’t digitize it and put it online, will our kids?


Carl Malamud. When I think of the DPLA, I think of the Hoover Dam and the Golden Gate Bridge. There’s a tremendous reservoir of knowledge waiting to be tapped. Our Internet is flooded with only certain types of knowledge, and other types are not available to all. E.g., our law and policies — the operating system of our society — are not openly available because private fences have enclosed. E.g., if you’re a creator, you draw on imagery that has accumulated over thousands of years. Creative workers must stand on the shoulders of giants. But much of that image is locked up in for-profit corps that have built walls around public domain material. Even the Smithsonian only allows its images to be used by paying for them. We already have beautiful museums and bottomless libraries. What if the DPLA created a common reservoir that we could tap into. What if the Hathi Trust put everything that have into a common pool. Another metaphor: A bridge that connects our capitol to the rest of the country. DC is a vast storehouse. Most of the resources are hidden. We need public works projects for knowledge. A national digitization project, a decade long. Deploy the Internet Core of Engineers. “If a self-appointed librarian in an old church can publish 2M books, why can’t our government do more?


[I had to see a man about a dog, and missed a couple of questions.]


Q: How do we transform the use of public libraries?
Peggy: They have to evolve, and many are evolving already. E.g., user-created content. 46% of low-income families don’t have computers or Internet access.


Q: Bandwidth is a critical issue, particularly in rural areas. I hope that the DPLA realizes it’s going to have data-heavy materials. How are we going to build bandwidth to the public libraries?
Peggy: I’m happy to see the Gates Foundation here. They’ve worked with local libraries to provide and maintain bandwidth. 5mb is not enough when kids swarm in after school.


Q: Imagine an Ecuadoran American mother who is a part time student. She belongs to a lot of communities. I want to make sure that the coding of the DPLA recognizes that we each live in multiple communities.
Peggy: We all agree.


Q: First, in 1991 a White House conf was talking about not just scanning, but enable people to send in their materials (e.g., super8 family movies) that could be digitized. Second, DPLA has a huge potential for freeing up resources at the local library so it can spend its resources on customizing content to what that community needs, or let the person customize the library for herself.


Q: How does an ordinary person get involved in DPLA right now. Lobbying?
John: Lots of ways. Mobilization counts. The effect on local libraries needs to be explained; no one here thinks or wants the DPLA to hurt local public libraries. That’s a crazy thought. But that needs to be explained. I would be so sorry if this project led to the closing of a single library. And, yes, I think we should have a way for individuals to donate. How can you get involved in the setting up of this project: Deciding what the DPLA is an open process. There are six workstreams. Today is meant in part as an invitation to join in those workstreams. There will be meetings over the next 18 months; the meetings will be open. Come. We need people to build with what we create. We need people to think of new use cases. In April 2013 when we come together for the launch, if there are ten more people attending, that will be a sign of success.


Q: What do you have in the collection for children, 0-8? Why will a parent want to use the DPLA?
John: The DPLA needs to create a common infrastructure so people can create libraries and services out of the combined collection. But as a parent of a six and 9 year old, we’ll keep buying paper books and reading to our kids. The DPLA is not a replacement.
Peggy: Univ. of Texas in Arlington did a study at what engages students in the study of the history of Texas. Students perform better on tests if they had a greater interaction with real documents. We’re bringing history to the classrooms.
Carl: The Encyclopedia of Life has pictures of bugs, etc. And the Smithsonian has a great online resource [didn't catch it], and the net thing the kid will want to do is visit the Smithsonian.
Amanda: If it isn’t online people don’t know it exists. If they know …[Ack. Lost the rest of this post. Noooooo]

1 Comment »

[dpla] DPLA plenary

I’m at what is in effect the public launch of the Digital Public Library of America — “in effect” because the DPLA has been open to all from the beginning. But today we’re in the theater of the National ARchives and have just been greeted by the Archivist of the United States, David Ferriero.

I spent yesterday at the “workstream” meetings of the DPLA. The openness of the DPLA has meant that there has been no moment at which all have agreed on precisely what the DPLA should be. Yesterday could have been a day that had people walking apart from one another or walking toward a center as yet to be fully located. It was a day of walking toward that emergent center. Given the continuing significant differences in the group, my sense that the convergence was enabled by a shared sense of the value of what we could build, by shared interests and backgrounds (a bunch of librarians and admirers of librarians), and by the carefully crafting of the day’s events and processes. (That last goes to the credit of the Berkman Center.)

I am very excited. (I’m also at maximum stress because I am giving a 8.5 minute demo this afternoon…talking to a screencast I did in my hotel room last night, leaving no room for temporal variance. You can see the live prototype here.)

Doron Weber of the Sloane Foundation is now briefly recounting the history of the DPLA, which started with a workshop a year ago. Doron today announced the beginning of a “two year grass roots effort” to build the DPLA. The DPLA is intended to be a platform for discovering our rich shared cultural heritage he says (approximately). He sketches a very broad agenda, including discovering collections, building them, partnering with other nations, sharing metadata, and exploring doing some form of collective licensing of in-copyright material. (Excellent. I personally don’t want this to become the Digital Public Library of Jane Austen.)

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Doron announces that Sloane and Arcadia are each contributing $2.5M to support the DPLA over the next 18 months. Woohoo! Peter Baldwin from Arcadia gives a gracious short talk.

Be the first to comment »

September 23, 2011

Tim Spalding on what libraries can learn from LibraryThing

I’m a huge admirer of LibraryThing for its innovative spirit, ability to scale social interactions, and its adding value to books. So, I was very happy to have a chance to interview Tim Spalding, its founder, for a Library Lab podcast, which is now posted.

1 Comment »

Next Page »