Joho the Blog » library

September 23, 2011

Tim Spalding on what libraries can learn from LibraryThing

I’m a huge admirer of LibraryThing for its innovative spirit, ability to scale social interactions, and its adding value to books. So, I was very happy to have a chance to interview Tim Spalding, its founder, for a Library Lab podcast, which is now posted.

Follow me

Categories: libraries, too big to know Tagged with: books • library • librarything • podcast Date: September 23rd, 2011 dw

1 Comment »

June 2, 2011

OCLC to release 1 million book records

At the LODLAM conference, Roy Tennant said that OCLC will be releasing the bibliographic info about the top million most popular books. It will be released in a linked data format, under an Open Database license. This is a very useful move, although we need to know what the license is. We can hope that it does not require attribution, and does not come with any further license restrictions. But Roy was talking in the course of a timed two-minute talk, so he didn’t have a lot of time for details.

This is at least a good step and maybe more than that.

Follow me

Categories: everythingIsMiscellaneous, libraries, open access, too big to know Tagged with: library • metadata • oclc • open access Date: June 2nd, 2011 dw

2 Comments »

May 17, 2011

[dpla] Amsterdam, Monday morning session

Jon Palfrey: The DPLA is ambitious and in the early stages. We are just getting our ideas and our team together. We are here to listen. And we aspire to connect across the ocean. In the U.S. we haven’t coordinate our metadata efforts well enough.

One of the core principals is interoperability across systems and nations. It also means interoperability at the human and institutional layers. “We should start with the presumption of a high level of interoperability.” We should start with that as a premise “in our dna.”

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Dan Brickley is asked to give us an on-the-spot, impromptu history of linked data. He begins with a diagram from Tim Berners Lee w3c.org/history/1989 that showed the utility of a cloud of linked documents and things. [It is the typed links of Enquire blown out to a web of info.] At an early Web conf in 1994 TBL suggested a dynamic of linked documents and of linked things. One could then ask questions of this network: What systems depend on this device? Where is the doc being used? RDF (1997) lets you answer such questions. It grew out of PICS, an early attempt to classify and rate Web objects. Research funding arrived around 2000. TBL introduced the semantic web. Conferences and journals emerged, frustrating hackers who thought RDF was about solving problems. The Semantic Web people seemed to like complex “knowledge representation” systems. The RDF folks were more like “Just put the data on the Web.”

For example, FOAF (friend of a friend) identified people by pointing to various aspects of the person. TBL in 2005 critiqued that, saying tht should instead point to URI’s. So, to refer to a person, you’d put int a URL to info that talk about them. Librarians were used to using URL’s as pointers, not information. TBL further said that the URI should point to more URI’s, e.g., the URL for the school that the person went to. TBLs 4 rules: You URIs for names for things. 2. Make sure http can fetch them. 3. Make sure what you fetch is machine-frineldy. 4. Make sure the links use URIs. This spreads the work of describing a resource around the Web.

Linked Data often takes a database-centric view of the world; building useful databases out of swarms of linked data.

Q: [me] What about ontologies?
A: When RDF began, an RDF scema defined the pieces and their relationships. OWL and ontologies let you make some additional useful restrictions. Linked data people tend to care about particularities. So, how do you get interoperability? You can do it. But the machine stuff isn;t subtle enough to be able to solve all these complex problems.

Europeana

Paul Keller says that copyright is supposed to protect works, but not the data they express. Cultural heritage orgs generally don’t have copyright on their material, but they insist on copyrighting the metadata they’ve generated. Paul is encouraging them to release their metadata into the public domain. The orgs are all about minimizing risk. Paul thinks the risks are not the point. They ought to just go ahead an establish themselvs as the preservers and sources of historical content. But the boards tend to be conservatve and risk-adverse.

Q: US law allows copyright of the arrangement of public domain content. And do any of the collecting societies assert copyright?
A: The OCLC operates the same way in Europe. There’s a proposed agreement that would authorize the aggregators to provide their aggregators under a CC0 public domain license.

Q: Some organizations that limit images to low-resolution to avoid copyright issues. Can you do the same for data?
A: A high res description has lots of information about how it deroved tje infro.

Antoine Isaac (Vrje Universteit Amsterdam) has worked on the data model for Europeana .EDE (Europeana Semantic Elements) are like a Dublin Core for objects: a lowest common denominator. They are looking at a richer model, Europeana Data Model. Problems: Ingesting refs to digitized material, ingesting descriptive metadata from man institutions, build generic services to enhance access top objects.

Fine-grained data: Merging multiple records can lead to self-contradiction. Have to remember who data came from which source. Must support objects that are composed of other objects. Support for contextual resources (e.g., descriptions of persons, objects, etc.) including concepts, at various levels of detail.

Europeana is aiming at interoperability through links (connecting resources), through semantics (complex data semantically interoperable with simpler objects), and through re-use of vocabularies (e.g., OAI-ORE, Dubliin Core, SKOS, etc.) They create a proxy object for the actual object, so they don’t have to mix with the data that the provider is providing. (Antoin stresses that the work on the data model has been highly collaborative.)

Q: Do we end up with what we have in looking up flight info? Or can we have single search?
A: Most important we’re working on the back end, not yet working on the front end.
The Lin

Q: Will you provide resolution services, providing all the identiiers that might go with an object?
A: Yes.

Q: Stefan Gradmann also points to the TBL diagram with typed linked. Linked Data extends this in type (RDF) and scope. RDF triples (subject-predicate-object). He refers to TBL’s four rules. Stefan says we may be at the point of having too many triples. The LinkingOpenData group wants to build a data commons. (see Tom Heath and Chris Bizer.) It is currently discussing how to switch from volume aggregation to quality. Quality is about “matching, mapping, and referring things to each other.”

The LOD project is different. It’s a large-scale integration project, running through Aug 2014. It’s building technology around the cloud of linked open data. It includes the Comprehensive Knowledge Archive Network (CKAM), DBpedia extraction from Wikipedia.

Would linked data work if it were not open? Technically, it’s feasible. But it’s very expensive, since you have to authorize the de-referencing of URIs. Or you could do it behind a proxy, so you use the work of others but do not contribute. Europeana is going for opennness, under CCO: http://bit.ly/fe637P You cannot control how open data is used, you can’t make money from it, and you need attractive services to built on top of it, including commercial services. Europeana does not exclude commercial reuse of linked open data. Finally, we need to be able to articulate what the value of this linked data is.

Q: How do we keep links from rotting?
A: The Web doesn’t understand versioning. One option is to use the ORE resource maps, versioning aggregations.

Q: Some curators do not want to make sketchy metadata public.
A: The metadata ought to state that the metadata is sketchy, and ask the user to improve it. We need to track the meta-metadata.

Stefan: We only provide top-level classifications and encourage providers to add the more fine-grained.

Q: How do we establish the links among the bubbles? Most are linked to DBpedia, not to one another?
A: You can link on schema or instance level. The work doesn’t have to be done solely by Europeana.

Q: The World Intellectual Property Organization is meeting in the fall. A library federation is proposing an ambitious international policy on copyright. Perhaps there should be a declaration of a right to open metadata.
A: There are database rights in Europe, but generally not outside of it. CCO would normalize the situation. We think you don’t have to require attribution and provenance because norms will handle that, and requiring it would slow development.

Q: You are not specifying below a high level of classification. Does that then fragment the data?
A: We allow our partners to come together with shared profiles. And, yes, we get some fragmentation. Or, we get diversity that corresponds to diversity in the real world. We can share contextualization policies: which are our primary goals when contextualizing goals, e.g., we use VIAF rather than FOAF when contextualizing a person. Sort of a folksonomic process: a contributor will see that others have used a particular vocabulary.

Q: Persistence. How about if you didn’t have a central portal and made the data available to individual partners. E.g., I’m surprised that Europeana’s data is not available through a data dump.
A: The license rights prevent us from providing the data dump. One interesting direction: move forward from the identifiers the institutions already have. Institutions usually have persistent identifiers, even though they’re particular to that institution. It’d be good to leverage them.
A: Europeana started before linked open data was prominent. Initially it was an attempt to build a very big silo. Now we try to link up with the LoD cloud. Perhaps we should be thinking of it as a cloud of distributed collections linked together by linked data.

Q: We provide bibliographic data to Europeana. I don’t see attribution as a barrier. We’d like to some attribution of our contribution. As Europeana bundles it, how does that get maintained?
A: Europeana is structurally required to provide attribution of all the contributors in the chain.

Q: Attribution even share-alike can be very attractive for people providing data into the commons. Linux, Open Street Map, and Wikipedia all have share-alike.
A: The immediate question is non-commercial allowed or not.

Q: Suppose a library wanted to make its metadata openly available?
A: SECAN.

Follow me

Categories: culture, libraries Tagged with: dpla • library Date: May 17th, 2011 dw

1 Comment »

May 11, 2011

James Bridle – first Library Innovation Lab podcast

James Bridle is the interviewee in the first in a series of podcasts I’m doing for the Harvard Library Innovation Lab.

I met James at a conference in Israel a few weeks ago, and had the great pleasure of getting to hang out with him. He’s a British book-lover and provocateur, who expresses his deep insights through his wicked sense of humor.

Thanks to Daniel Dennis “Magnificent” Jones [twitter:blanket] for producing the series, doing the intros, choosing the music, writing the page…

Follow me

Categories: culture, libraries Tagged with: books • library • library innovation lab • lil • podcast Date: May 11th, 2011 dw

7 Comments »

May 10, 2011

[berkman] Culturomics: Quantitatve analysis of culture using millions of digitized books

Erez Lieberman Aiden and Jean-Baptiste Michel (both of Harvard, currently visiting faculty at Google) are giving a Berkman lunchtime talk about “culturomics“: the quantitative analysis of culture, in this case using the Google Books corpus of text.

The traditional library behavior is to read a few books very carefully, they say. That’s fine, but you’ll never get through the library way. Or you could read all the books, very, very not carefully. That’s what they’re doing, with interesting results. For example, it seems that irregular verbs become regular over time. E.g., “shrank” will become “shrinked.” They can track these changes. They followed 177 irregular verbs, and found that 98 are still irregular. They built a table, looking at how rare the words are. “Regularization follows a simple trend: If a verb is 100 times less frequent, it regularizes 10 times as fast.” Plus you can make nice pictures of it:

Usage is indicated by font size, so that it’s harder for the more used words to get through to the regularized side.

The Google Books corpus of digitized text provides a practical way to be awesome. Erez and Jean-Baptiste got permission from Google to trawl through that corpus. (It is not public because of the fear of copyright lawsuits.) They produced the n-gram browser. They constructed a table of phrases, 2B lines long.

129M books have been published. 18M have been scanned. They’ve analysed 5M of them, creating a table with 2 billions rows. (In some cases, the metadata wasn’t good enough. In others, the scan wasn’t good enough.)

They show some examples of the evolution of phrases, e.g. thrived vs. throve. As a control, they looked at 43 Heads of State and found that the year they took power usage of “head of state” zoomed (which confirmed that the n-gram tool was working).

They like irregular verbs in part because they work out well with the ngram viewer, and because there was an existing question about the correlation of irregular and high-frequency verbs. (It’d be harder to track the use of, say, tables. [Too bad! I’d be interested in that as a way of watching the development of the concept of information.]) Also, irregular verbs manifest a rule.

They talk about chode’s change to chided in just 200 yrs. The US is the leading exporter of irregular verbs: burnt and learnt have become regular faster than others, leading the British’s usage.

They also measure some vague ideas. For example, no one talked about 1950 until the late 1940s, and it really spiked in 1950. We talked about 1950 a lot more than we did, say, 1910. The fall-off rate indicates that “we lose interest in the past faster and faster in each passing year.” They can also measure how quickly inventions enter culture; that’s speeding up over time.

“How to get famous?” They looked at the 50 most famous people born in 1871, including Orville Wright, Ernest Rutherford, Marcel Proust. As soon as these names passed the initial threshhold (getting mentioned in the corpus as frequently as the least-used words in the dictionary) their mentions rise quickly, and then slowly goes down. The class of 1871 got famous at age 34; their fame doubled every four years; they peaked at 73, and then mentions go down. The class of 1921’s rise was faster, and they became famous before they became 30. If you want to become famous fast, you should become an actor (because they become famous in the mid to late 20s), or wait until your mid 30s and become a writer. Writers don’t peak as quickly. The best way to become famous is to become a politician, although have to wait until you’re 50+. You should not become an artist, physicist, chemist or mathematicians.

They show the frequency charts for Marc Chagall, US vs. German. His German fame dipped to nothing during the Nazi regime who suppressed him because he was a Jew. Likewise with Jesse Owens. Likewise with Russian and Chinese dissidents. Likewise for the Hollywood Ten during the Red Scare of the 1950s. [All of this of course equates fame with mentions in books.] They show how Elia Kazan and Albert Maltz’s fame took different paths after Kazan testified to a House committee investigating “Reds” and Maltz did not.

They took the Nazi blacklists (people whose works should be pulled out of libraries, etc.) and watched how they affected the mentions of people on them. Of course they went down during the Nazi years. But the names of Nazis went up 500%. (Philosophy and religion was suppressed 76%, the most of all.)

This led Erez and Jean-Baptiste to think that they ought to be able to detect suppression without knowing about it beforehand. E.g., Henri Matisse was suppressed during WWII.

They posted theirngrams viewer for public access. From the viewer you can see the actual scanned text. “This is the front end for a digital library.” They’re working with the Harvard Library [not our group!] on this. In the first day, over a million queries were run against it. They are giving “ngrammies” for the best queries: best vs. beft (due to a character recognition error); fortnight; think outside the box vs. incentivize vs. strategize; argh vs aargh vs argh vs aaaargh. [They quickly go through some other fun word analyses, but I can’t keep up.]

“Cultoromics is the application of high throughput data collection and analysis to the study of culture.” Books are just the start. As more gets digitized, there will be more we can do. “We don’t have to wait for the copyright laws to change before we can use them.”

Q: Can you predict culture?
A: You should be able to make some sorts of predictions, but you have to be careful.

Q: Any examples of historians getting something wrong? [I think I missed the import of this]
A: Not much.

Q: Can you test the prediction ability with the presidential campaigns starting up.
A: Interesting.

Q: How about voice data? Music?
A: We’ve thought about it. It’d be a problem for copyright: if you transcribe a score, you have a copyright on it. This loads up the field with claimants. Also, it’s harder to detect single-note errors than single-letter errors.

Q: Do you have metadata to differentiate fiction from nonfiction, and genres?
A: Google has this metadata, but it comes from many providers and is full of conflicts. The ngram corpus is unclean. But the Harvard metadata is clean and we’re working with them.

Q: What are the IP implications?
A: There are many books Google cannot make available except through the ngram viewer. This gives digitizers a reason to digitize works they might otherwise leave alone.

Q: In China people use code words to talk about banned topics. This suppresses trending.
A: And that takes away some of the incentive to talk about it. It cuts off the feedback loop.

Q: [me] Is the corpus marked up with structural info that you can analyze against, e.g., subheadings, captions, tables, quotations?
A: We could but it’s a very hard problem. [Apparently the corpus is not marked up with this data already.]

Q: Might you be able to go from words to metatags: if you have cairo, sphinx, and egypt, you can induce “egypt.” This could have an effect on censorship since you can talk about someone without using her/his name.
A: The suppression of names may not be the complete suppression of mentions, yes. And, yes, that’s an important direction for us.

Follow me

Categories: berkman, copyright, too big to know Tagged with: 2b2k • berkman • google • irregular verbs • library Date: May 10th, 2011 dw

2 Comments »

March 1, 2011

Digital Public Library of America

I’m at the first workshop of the Digital Public Library of America, which is studying how we might build such a thing. Fascinating meeting so far. But it’s under Chatham House rules, which means that there’s no attribution of ideas and quotes. So, I’m tweeting it without attributions. Hashtag: #dpla. John Palfrey is liveblogging it.

Follow me

Categories: libraries, open access, too big to know Tagged with: dpla • library • open access Date: March 1st, 2011 dw

1 Comment »

January 26, 2011

McLuhan in his own voice

As a gift on the centenary of Marshall McLuhan’s birth, a site has gone up with videos of him explaining his famous sayings. Some of them still have my scratching my head, but other clips are just, well, startling. For example, this description of the future of books is from 1966.

Follow me

Categories: libraries, media Tagged with: books • library • mcluhan Date: January 26th, 2011 dw

2 Comments »

July 12, 2008

Mr. Dewey, tear down that wall!

Tim Spalding, founder of the estimable LibraryThing, is calling on us all to create an open shelves classification project to replace Dewey and his pals. LibraryThing is a brilliant implementation of a what a library built on a social network of readers can be, so I’m excited about Tim’s new idea.

[Tags: library taxonomies tim_spalding librarything everything_is_miscellaneous ]

Follow me

Categories: Uncategorized Tagged with: everything_is_miscellaneous • library • librarything • taxonomies • tim_spalding • uncat Date: July 12th, 2008 dw

Be the first to comment »

« Previous Page