Joho the Blog » 2011

May 17, 2011

[dpla] Amsterdam afternoon

I moderated a panel in the afternoon on open bibliographic data. I couldn’t also live blog it.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Paul Keller talks about Europeana’s way of handling public domain material. They have non-binding guidelines, explaining the legalities as well as setting a set or norms (“Be culturally aware,” etc.). Europeana lets you filter based on rights restrictions. He shows a public domain calculator that follows a complex decision chart to decide if something is in the public domain, based on the copyright rules of thirty countries.

Q: Our biggest problem is having the providers give us the license data in the first place.
A: Europeana ingested rights info from the beginning (from the dc:rights field).

Q: What claims are Europeana making about what’s contributed to it? Are you assuming any liability? And are you asserting any moral rights?
A: Europeana doesn’t host the content so it does not assert any rights. The public domain calculator does notice jurisdictions where moral rights are asserted, at the end of the process it warns you that there may be a claim of moral rights.

John Weise of U. of Michigan and Hathi Trust on “determining rights and opening access in Hathi Trust.” He manages the digital library production service at U. of Mich. Hathi Trust has 8.6M volumes, 2.2M in public domain, 4.7M book titles, and 210,000 serial titles. It has a steep and steady growth rate. They’ve had 5,000 rights holders agree to open up their works, and very very who had registered take-down notices. They have 18 staff members reviewing books published between 1923 and 1963. They’ve reviewed 135K, and found half to be in the public domain. He urges libraries to make full use of Fair Use.

Hathi Trust is starting a project to identify orphaned works (in copyright but rights-holders can’t be reached). They are establishing best practices, and also trying to find the rights-holders for works published between 1923 and 1963.

Paola Mazzucchi from ARROWS Rights talks about ARROW. ARROW “is a comprehensive system for facilitating rights information management in any digitization program supporting the diligent search process” for the rights-holders of orphan works. To manage licenses, you have to manage rights. To manage rights, she says, you need to involve the entire value chain and to bridge all the gaps: cultural gaps among stakeholders, interoperability gaps, etc. “If you want digital libraries without black holes, you have to manage the rights info.”

Lucie Guibault says that the most important point is the “human factor.” Europe does not have a Fair Use exemption, so they’re looking to Scandinavia’s extended collective licenses. It provides access to non-members of the collective so long as the rights-holder can opt out. [I hope I got that right.] The toughest issue is getting the license accepted across borders.

Urs Gasser from the Berkman Center. Legal interoperability is important to libraries. The problem is not just copyright law, but also the private contractual agreements libraries enter into with content providers. Two important words: Transparency. Collaborative processes. He offers some observations. First, it’s important to look at history, but also not to learn the wrong lessons. Second, the participants in the DPLA have many different, conflicting interests. Finally, we need to be able to answer precisely the question about the value DPLA has brought, and we need to be communicating well, starting now.

Follow me

Categories: libraries Tagged with: dpla • libraries Date: May 17th, 2011 dw

3 Comments »

[dpla] Europeana

About fifteen of us are meeting with Europeana in their headquarters in The Haag.

Harry Verwayen (business development director) gives us some background. Europeana started in 2005, in the wake of Google’s digitization of books. In 2008, the program itself began. It is a catalog that collects metadata, a small image, and a pointer. By 2011, they had 18,745,000 objects from thousands of partner institutions. It has been about getting knowledge into one place (Giovanni Pico della Mirandola). They believe all metadata should be widely and freely available for all use, and all public domain material should be freely available for all.

What are the value propositions for its constituencies? For end users, it’s a trusted source. For providers, it’s visibility; there is tension here because the providers want to measure visibility by hits on the portal, but Europeana wants to make the material available anywhere through linked open data. For policy makers, it’s inclusion. For the market, it’s growth. Four fnctions:

1. Aggregate: Council of content providers and aggregators. They want to always get more and better content. And they want to improve data quality.

2. Facilitate: Share knowledge. Strengthen advocacy of openness. Foster R&D.

3. Distribute: Making it available. Develop partnerships.

Engage Virtual exhibits, social media, e.g., collect user-generated content about WWI.

From all knowledge in one place, to all knowledge everywhere.

Q: If you were starting out now, would you go down the same path?
A: It’s important to have a clear focus. E.g., the funding politicians like to have a single portal page, but don’t focus on that. You need to have one, but 80% of our visitors come in from Google. The chances that users will go to DPLA via the portal are small. You need it, but you it shouldn’t be the focus of your efforts. .

Q: What is your differentiator?
A: Secure material from institutions, and openness.

Q: What are your use cases?
A: It’s the only place you can search across libraries, museums. We have been aggregating content. Things are now available without having to search thousands of sites.

Q: Next stage?
A: We’re flipping from supply to demand side. Make it available openly to see what other people can do with it. Right now the API is open to partners, but we plan on opening it up.

Q: How many users>
A: About 5M portal and API visitors last year.

Q: Your team?
A: Main office is 5M euros, 40 people [I think].

What’s your brand?
A: You come here if you want to do some research on Modigliani and want to find materials across museums and libraries. It’s digitized cultural heritage. But that’s widely defined. We have archives of press photography, a large collection of advertising posters, etc. But we’re not about providing access to popular works, e.g., recent novels.

Q: Any partners see a pickup in traffic since joining Europeana?
A: Yes. Not earth-shaking but noticeable.

What’s the biggest criticism?
A: Some partners feel that we’re pushing them into openness.

What level of services? Just a catalog, or create your own viewers, e.g.?
A: First, be a good catalog. Over the next five years, we’ll develop more. We do provide a search engine that you can use on your Web site.

Jan Molendijk talks on the tech/ops side. He says people see Europeana in many different ways: web portal, search engine, metadata repository, network organization, and “great fun.” The participating organizations love to work with Europeana.

The tech challenges: There are four domains (libraries, archives, museums, audiovisual), each with their own metadata standards. 26 languages. Distributed development. The metadata comes in in original languages. There’s too much to crowd-source. Also, there’s a difference between metadata search and full-text search, of course. We represent metadata as traditional docs and index them. The metadata fields allow greater precision. But full-text search engines expect docs to have thousands of words, but these metadata docs have dozen of words; the fewer words, the less well the search engines work; e.g., a short doc has fewer matches and scores lower on relevancy. Also, with a small team, much of the work gets farmed out.

15% in French, 14% in German, 11% in English. The distribution curve of most viewed objects count for less than 0.1% of views. Most get viewed 1 time per year or less. Our distribution curve starts low and flattens slowly. A highly viewed object is viewed perhaps 1,500 times in a month, and it’s usually tied to a promotion.

What type of group structures do you have? You could translate at that level and the rest would inherit.
A: We are not going to translate at the item level.

Collection models?
A: Originally, not even nesting, Now we use EDM. Now can arbitrarily connect pieces, as extensions, but we’re not doing that yet.

Europeana is designed to be scalable and robust. All layers can be executed on separate machines, and on multiple machines. They have four portal servers, two Solr services, and two image servers. Solr is good at indexing and point to an object, but not good at fetching from itself.

They don’t host it.

They use stateless protocols and very long urls.

Data providers give them the original metadata plus a mapping file. They map to EDM. They have a staff of three that handles the data ingestion. The processes have to be lightweight and automated, but 40-50% of development time still will go to metadata input: ingestion, enrichment, harvesting.

They publish through their portal, linked open data, OAI-MPH, API’s, widgets, and apps.

Annette Friberg talks about aggregation projects. Europeana is pan-European and across domains. Europeana would like to work with all the content providers, but there are only 40 people on stafff, so they instead work with about a relatively small number of aggregators. Those represent thousands and thousands of content providers. They have a Council of Content Providers and Aggregators.

Q: What should we avoid?
A: The largest challenge is the role of the content providers.

Q: Does clicking on a listing always take you out to the owner’s site?
A: Yes, almost always. And that’s a problem for providing a consistent user experience.

Valentina talks about the ingestion inflow [link] If you want to provide content, you can go to a form that asks some basic questions about copyright, topic, link to the object. It’s reviewed by the staff; they reject content that is not from a trustworthy source. Then you get a technical questionnaire: the quantity and type of materials, the format of the metadata, the frequency of updates, etc. They harvest metadata in the ESE format (Europeana Semantic Elements). They use OAI-PMH for harvesting. They enrich it with some data, do some quality checking, and upload it. They also cache the thumbnail. At the moment they are not doing incremental harvesting, so an update requires reimporting the entire collection, but they’re working on it.

They have started requiring donators to fill in a few fields of basic metadata, including the work’s title and a link to an image to be thumbnailed. But it’s still very minimal, in order to lower the hurdle.

Q: [me] In the US, it would be flooeded with bogus institutions eager to have their work displayed: porn, racist and extremist groups, etc.
A: We check to see if it’s legit. Is it a member of professional orgs? What do their peers say? We make a decision.

Follow me

Categories: culture, libraries Tagged with: dpla • europeana • libraries • metadata Date: May 17th, 2011 dw

Be the first to comment »

[dpla] Amsterdam, Monday morning session

Jon Palfrey: The DPLA is ambitious and in the early stages. We are just getting our ideas and our team together. We are here to listen. And we aspire to connect across the ocean. In the U.S. we haven’t coordinate our metadata efforts well enough.

One of the core principals is interoperability across systems and nations. It also means interoperability at the human and institutional layers. “We should start with the presumption of a high level of interoperability.” We should start with that as a premise “in our dna.”

Dan Brickley is asked to give us an on-the-spot, impromptu history of linked data. He begins with a diagram from Tim Berners Lee w3c.org/history/1989 that showed the utility of a cloud of linked documents and things. [It is the typed links of Enquire blown out to a web of info.] At an early Web conf in 1994 TBL suggested a dynamic of linked documents and of linked things. One could then ask questions of this network: What systems depend on this device? Where is the doc being used? RDF (1997) lets you answer such questions. It grew out of PICS, an early attempt to classify and rate Web objects. Research funding arrived around 2000. TBL introduced the semantic web. Conferences and journals emerged, frustrating hackers who thought RDF was about solving problems. The Semantic Web people seemed to like complex “knowledge representation” systems. The RDF folks were more like “Just put the data on the Web.”

For example, FOAF (friend of a friend) identified people by pointing to various aspects of the person. TBL in 2005 critiqued that, saying tht should instead point to URI’s. So, to refer to a person, you’d put int a URL to info that talk about them. Librarians were used to using URL’s as pointers, not information. TBL further said that the URI should point to more URI’s, e.g., the URL for the school that the person went to. TBLs 4 rules: You URIs for names for things. 2. Make sure http can fetch them. 3. Make sure what you fetch is machine-frineldy. 4. Make sure the links use URIs. This spreads the work of describing a resource around the Web.

Linked Data often takes a database-centric view of the world; building useful databases out of swarms of linked data.

Q: [me] What about ontologies?
A: When RDF began, an RDF scema defined the pieces and their relationships. OWL and ontologies let you make some additional useful restrictions. Linked data people tend to care about particularities. So, how do you get interoperability? You can do it. But the machine stuff isn;t subtle enough to be able to solve all these complex problems.

Europeana

Paul Keller says that copyright is supposed to protect works, but not the data they express. Cultural heritage orgs generally don’t have copyright on their material, but they insist on copyrighting the metadata they’ve generated. Paul is encouraging them to release their metadata into the public domain. The orgs are all about minimizing risk. Paul thinks the risks are not the point. They ought to just go ahead an establish themselvs as the preservers and sources of historical content. But the boards tend to be conservatve and risk-adverse.

Q: US law allows copyright of the arrangement of public domain content. And do any of the collecting societies assert copyright?
A: The OCLC operates the same way in Europe. There’s a proposed agreement that would authorize the aggregators to provide their aggregators under a CC0 public domain license.

Q: Some organizations that limit images to low-resolution to avoid copyright issues. Can you do the same for data?
A: A high res description has lots of information about how it deroved tje infro.

Antoine Isaac (Vrje Universteit Amsterdam) has worked on the data model for Europeana .EDE (Europeana Semantic Elements) are like a Dublin Core for objects: a lowest common denominator. They are looking at a richer model, Europeana Data Model. Problems: Ingesting refs to digitized material, ingesting descriptive metadata from man institutions, build generic services to enhance access top objects.

Fine-grained data: Merging multiple records can lead to self-contradiction. Have to remember who data came from which source. Must support objects that are composed of other objects. Support for contextual resources (e.g., descriptions of persons, objects, etc.) including concepts, at various levels of detail.

Europeana is aiming at interoperability through links (connecting resources), through semantics (complex data semantically interoperable with simpler objects), and through re-use of vocabularies (e.g., OAI-ORE, Dubliin Core, SKOS, etc.) They create a proxy object for the actual object, so they don’t have to mix with the data that the provider is providing. (Antoin stresses that the work on the data model has been highly collaborative.)

Q: Do we end up with what we have in looking up flight info? Or can we have single search?
A: Most important we’re working on the back end, not yet working on the front end.
The Lin

Q: Will you provide resolution services, providing all the identiiers that might go with an object?
A: Yes.

Q: Stefan Gradmann also points to the TBL diagram with typed linked. Linked Data extends this in type (RDF) and scope. RDF triples (subject-predicate-object). He refers to TBL’s four rules. Stefan says we may be at the point of having too many triples. The LinkingOpenData group wants to build a data commons. (see Tom Heath and Chris Bizer.) It is currently discussing how to switch from volume aggregation to quality. Quality is about “matching, mapping, and referring things to each other.”

The LOD project is different. It’s a large-scale integration project, running through Aug 2014. It’s building technology around the cloud of linked open data. It includes the Comprehensive Knowledge Archive Network (CKAM), DBpedia extraction from Wikipedia.

Would linked data work if it were not open? Technically, it’s feasible. But it’s very expensive, since you have to authorize the de-referencing of URIs. Or you could do it behind a proxy, so you use the work of others but do not contribute. Europeana is going for opennness, under CCO: http://bit.ly/fe637P You cannot control how open data is used, you can’t make money from it, and you need attractive services to built on top of it, including commercial services. Europeana does not exclude commercial reuse of linked open data. Finally, we need to be able to articulate what the value of this linked data is.

Q: How do we keep links from rotting?
A: The Web doesn’t understand versioning. One option is to use the ORE resource maps, versioning aggregations.

Q: Some curators do not want to make sketchy metadata public.
A: The metadata ought to state that the metadata is sketchy, and ask the user to improve it. We need to track the meta-metadata.

Stefan: We only provide top-level classifications and encourage providers to add the more fine-grained.

Q: How do we establish the links among the bubbles? Most are linked to DBpedia, not to one another?
A: You can link on schema or instance level. The work doesn’t have to be done solely by Europeana.

Q: The World Intellectual Property Organization is meeting in the fall. A library federation is proposing an ambitious international policy on copyright. Perhaps there should be a declaration of a right to open metadata.
A: There are database rights in Europe, but generally not outside of it. CCO would normalize the situation. We think you don’t have to require attribution and provenance because norms will handle that, and requiring it would slow development.

Q: You are not specifying below a high level of classification. Does that then fragment the data?
A: We allow our partners to come together with shared profiles. And, yes, we get some fragmentation. Or, we get diversity that corresponds to diversity in the real world. We can share contextualization policies: which are our primary goals when contextualizing goals, e.g., we use VIAF rather than FOAF when contextualizing a person. Sort of a folksonomic process: a contributor will see that others have used a particular vocabulary.

Q: Persistence. How about if you didn’t have a central portal and made the data available to individual partners. E.g., I’m surprised that Europeana’s data is not available through a data dump.
A: The license rights prevent us from providing the data dump. One interesting direction: move forward from the identifiers the institutions already have. Institutions usually have persistent identifiers, even though they’re particular to that institution. It’d be good to leverage them.
A: Europeana started before linked open data was prominent. Initially it was an attempt to build a very big silo. Now we try to link up with the LoD cloud. Perhaps we should be thinking of it as a cloud of distributed collections linked together by linked data.

Q: We provide bibliographic data to Europeana. I don’t see attribution as a barrier. We’d like to some attribution of our contribution. As Europeana bundles it, how does that get maintained?
A: Europeana is structurally required to provide attribution of all the contributors in the chain.

Q: Attribution even share-alike can be very attractive for people providing data into the commons. Linux, Open Street Map, and Wikipedia all have share-alike.
A: The immediate question is non-commercial allowed or not.

Q: Suppose a library wanted to make its metadata openly available?
A: SECAN.

Follow me

Categories: culture, libraries Tagged with: dpla • library Date: May 17th, 2011 dw

1 Comment »

May 16, 2011

Ethan on serendipity and cosmopolitanism

Ethan Zuckerman blogs the brilliant and delightful “extended dance mix” of his talk on serendipity at CHI 2011.

He begins by wondering why people migrate to cities, even when those cities have been vastly unappealing, as per the stink of London in the mid 19th century. “You came to the city to become a cosmopolitan, a citizen of the world.” You may still have encountered a tiny stretch of humanity that way, but you’d at least be in a position to receive information about the rest of the world. “To the extent that a city is a communications technology, it may not be a surprise that early literally portrayals of the internet seized on the city as a metaphor.”

Ethan wonders if cities actually do work as “serendipity engines,” as we hope they do. Nathan Eagle “estimates that he can predict the location of ‘low-entropy individuals’ with 90-95% accuracy” based on aggregated mobile phone records. [Marta C. Gonzalez, Cesar A. Hidalgo & Albert-Laszlo ? Barabasi recently in Nature made a related claim.] We are not as mobile as we think, and our patterns are more routinized than we’d like to believe. Even in cities we manage to mainly hang out with people like ourselves.

Likewise on the Net, Ethan says. He’s analyzed the media preferences of 33 nations, and found that countries that have 40+ million Net users tend to strongly prefer local news sources. The result is “we miss important stories.” Even if we are well-plugged in to a social network, we’re not going to learn about that which our friends do not know. Ethan reminds us that we need to worry about “filter bubbles,” as Eli Pariser calls them. While social filters are powerful, if they only filter your own network, they are likely to hide more than they show.

Against this Ethan recommends serendipity, which requires “an open and prepared mind.” We should learn from cities when designing Web spaces that enable and encourage serendipity. “What makes cities livable, creative, vital, and ultimately, safe is the street-level random encounter that [Jane] Jacobs documented in her corner of Greenwich Village.” Design to “minimize isolation.”

Ethan then talks about some of the ways we get guided serendipity in cities — friends showing you around, local favorites, treating a city like a board game via geocaching, etc. As always, Ethan has some amazing examples. (He even points to the Library Innovation Lab‘s ShelfLife project, where I work; I promise I didn’t realize that until I’d already started blogging about his post.)

I’d started blogging about Ethan’s post because I love what he says even though I have a knee-jerk negative reaction to much of what people say about serendipity on the Net. Ethan is different. His post represents a full-bodied conceptualization. I read it and I nod, smile at the next insight, then nod again. So, what follows is not a commentary on Ethan’s post. It’s actually all about my normal knee-jerk reaction. (Oh, bloggers, what _isn’t_ all about you?) I’m trying to understand why serendipity doesn’t square with the hole in my own personal pegboard.

Perhaps the problem is that I think of serendipity as a sub-class of distraction: Serendipity occurs when something that hijacks our attention (= a distraction) is worthwhile in some sense. We now have social networks that are superb at sharing serendipitous findings. So, why don’t we pass around more stuff that would make us more cosmopolitan? Fundamentally, I think it’s because interest is a peculiar beast. We generally don’t find something interesting unless it helps us understand what we already care about. But the Other — the foreign — is pretty much defined as that to which we see no connection. It is Other because it does not matter to us. Or, more exactly, we cannot see why or how it matters.

Things can matter to us in all sorts of ways, from casting a contrasting intellectual light on our everyday assumptions to opening up sluices of tears or laughter. But cosmopolitanism requires some level of understanding since it is (as I understand it) an appreciation of differences. That is, we can (and should) be filled with sorrow when we see a hauntingly disturbing photo of a suffering human in a culture about which we know nothing; that’s a connection based on the fundament of shared humanity, but it’s not yet cosmopolitanism. For that, we also have to appreciate the differences among us. Of course, appreciating differences also means finding the similarities. It is a dialectic for sure, and one so very easy to get wrong and impossible to get perfectly right: We misunderstand the Other by interpreting it too much in our own terms, or we write it off because it is so outside our own terms. Understanding always proceeds from a basic situatedness from which we make sense of our world, so cosmopolitan understanding is always going to be a difficult, imperfect dance of incorporating into the familiar that which is outside our usual ken.

This is why I don’t frame the failure of cosmopolitanism primarily in terms of serendipity. Serendipity sometimes — not in Ethan’s case — is proposed as a solution as if we can take our interest in the Other for granted: Just sneak some interesting African videos into our usual stream of youtubes of cute cats and people falling off of trampolines, and we will become more cosmopolitan. But, of course we will fast forward over those African videos as quickly as we used to turn the pages in newspapers that reported on Africa. The problem isn’t serendipity. It’s that we don’t care.

But, we can be brought to care. We know this because there are lots of examples (and Ethan recounts just a handful of the trove at his command) of our attention being arrested by cosmopolitan content. To generalize with a breadth that is sure to render the generalization vapid, cosmopolitan content that works — that gets us interested in something we hadn’t realized we cared about — seems to have two elements. First, it tells us what we need to know in order to let the otherness matter to us. Second, it is really well done. Both of these are difficult, and there is not a known formula for either of them. But there are also lots of known ways to try; Ethan gives us bunches of examples. Creating cosmopolitan content that works requires craft and, if it is to be transformative, art. It can range from the occasional Hollywood movie, to New Yorker articles, to blog posts, to Anthony Bourdain, to Ethan Zuckerman. Content that creates interest in itself may be extraordinarily difficult to craft, but it is a precusor to the possiblity of serendipity.

Take the wildly successful TED Talks as an example. They satisfy a need the “market” didn’t know it had, and if asked would probably deny: “Hey, do you have a burning interest in questioning the assumptions of bio-engineering?” TED Talks ripple through the social networks of serendipity because they create interest where formerly there wasn’t any. That’s how social serendipity works: It begins with works that through skill, craft, and art generate their own motive power. TED shows us that if we are trying to remedy the dearth of intellectually stimulating materials passing through social networks, we should worry first about creating materials that compel interest. Compelling materials create social serendipity. And the corollary: Nothing is interesting to us until it makes itself interesting to us.

But perhaps it simply comes down to this. Perhaps I don’t frame the failure of cosmopolitianism primarily as a problem with the lack of serendipity because I personally approach the world as a writer, and thus focus on the challenge of generating interest among readers. When I see people passing over a topic, I think, “Oh, it must not have been written well enough.” And on that idiosyncratic worldview, I would not seiously base an analysis of a topic as vast and important as the one that Ethan Zuckerman continues to address so profoundly.

Follow me

Categories: culture, peace, too big to know Tagged with: 2b2k • ethan zuckerman • ethanz • globalvoices • serendipity Date: May 16th, 2011 dw

1 Comment »

May 13, 2011

Berkman Buzz

This week’s Berkman Buzz:

Wendy Seltzer [twitter:wseltzer] inspects son-of-COICA:
link
OpenNet Initiative reports on the Syrian Electronic Army, and Facebook:
link
Media Cloud investigates Russian blogs, media and agenda-setting:
link
Ethan Zuckerman [twitter:ethanz] keynotes CHI 2011 — parataxis, cities, serendipity, design:
link
David Weinberger discusses e-books and much more with James Bridle:
link
Citizen Media Law Project [twitter:citmedialaw] introduces us to the OpenCourt project:
link
Weekly Global Voices [twitter:globalvoices] : “Uganda: Museveni’s Swearing in Overshadowed by Rival’s Return”
link

Follow me

Categories: berkman Tagged with: berkman Date: May 13th, 2011 dw

Be the first to comment »

May 11, 2011

James Bridle – first Library Innovation Lab podcast

James Bridle is the interviewee in the first in a series of podcasts I’m doing for the Harvard Library Innovation Lab.

I met James at a conference in Israel a few weeks ago, and had the great pleasure of getting to hang out with him. He’s a British book-lover and provocateur, who expresses his deep insights through his wicked sense of humor.

Thanks to Daniel Dennis “Magnificent” Jones [twitter:blanket] for producing the series, doing the intros, choosing the music, writing the page…

Follow me

Categories: culture, libraries Tagged with: books • library • library innovation lab • lil • podcast Date: May 11th, 2011 dw

7 Comments »

May 10, 2011

[berkman] Culturomics: Quantitatve analysis of culture using millions of digitized books

Erez Lieberman Aiden and Jean-Baptiste Michel (both of Harvard, currently visiting faculty at Google) are giving a Berkman lunchtime talk about “culturomics“: the quantitative analysis of culture, in this case using the Google Books corpus of text.

The traditional library behavior is to read a few books very carefully, they say. That’s fine, but you’ll never get through the library way. Or you could read all the books, very, very not carefully. That’s what they’re doing, with interesting results. For example, it seems that irregular verbs become regular over time. E.g., “shrank” will become “shrinked.” They can track these changes. They followed 177 irregular verbs, and found that 98 are still irregular. They built a table, looking at how rare the words are. “Regularization follows a simple trend: If a verb is 100 times less frequent, it regularizes 10 times as fast.” Plus you can make nice pictures of it:

Usage is indicated by font size, so that it’s harder for the more used words to get through to the regularized side.

The Google Books corpus of digitized text provides a practical way to be awesome. Erez and Jean-Baptiste got permission from Google to trawl through that corpus. (It is not public because of the fear of copyright lawsuits.) They produced the n-gram browser. They constructed a table of phrases, 2B lines long.

129M books have been published. 18M have been scanned. They’ve analysed 5M of them, creating a table with 2 billions rows. (In some cases, the metadata wasn’t good enough. In others, the scan wasn’t good enough.)

They show some examples of the evolution of phrases, e.g. thrived vs. throve. As a control, they looked at 43 Heads of State and found that the year they took power usage of “head of state” zoomed (which confirmed that the n-gram tool was working).

They like irregular verbs in part because they work out well with the ngram viewer, and because there was an existing question about the correlation of irregular and high-frequency verbs. (It’d be harder to track the use of, say, tables. [Too bad! I’d be interested in that as a way of watching the development of the concept of information.]) Also, irregular verbs manifest a rule.

They talk about chode’s change to chided in just 200 yrs. The US is the leading exporter of irregular verbs: burnt and learnt have become regular faster than others, leading the British’s usage.

They also measure some vague ideas. For example, no one talked about 1950 until the late 1940s, and it really spiked in 1950. We talked about 1950 a lot more than we did, say, 1910. The fall-off rate indicates that “we lose interest in the past faster and faster in each passing year.” They can also measure how quickly inventions enter culture; that’s speeding up over time.

“How to get famous?” They looked at the 50 most famous people born in 1871, including Orville Wright, Ernest Rutherford, Marcel Proust. As soon as these names passed the initial threshhold (getting mentioned in the corpus as frequently as the least-used words in the dictionary) their mentions rise quickly, and then slowly goes down. The class of 1871 got famous at age 34; their fame doubled every four years; they peaked at 73, and then mentions go down. The class of 1921’s rise was faster, and they became famous before they became 30. If you want to become famous fast, you should become an actor (because they become famous in the mid to late 20s), or wait until your mid 30s and become a writer. Writers don’t peak as quickly. The best way to become famous is to become a politician, although have to wait until you’re 50+. You should not become an artist, physicist, chemist or mathematicians.

They show the frequency charts for Marc Chagall, US vs. German. His German fame dipped to nothing during the Nazi regime who suppressed him because he was a Jew. Likewise with Jesse Owens. Likewise with Russian and Chinese dissidents. Likewise for the Hollywood Ten during the Red Scare of the 1950s. [All of this of course equates fame with mentions in books.] They show how Elia Kazan and Albert Maltz’s fame took different paths after Kazan testified to a House committee investigating “Reds” and Maltz did not.

They took the Nazi blacklists (people whose works should be pulled out of libraries, etc.) and watched how they affected the mentions of people on them. Of course they went down during the Nazi years. But the names of Nazis went up 500%. (Philosophy and religion was suppressed 76%, the most of all.)

This led Erez and Jean-Baptiste to think that they ought to be able to detect suppression without knowing about it beforehand. E.g., Henri Matisse was suppressed during WWII.

They posted theirngrams viewer for public access. From the viewer you can see the actual scanned text. “This is the front end for a digital library.” They’re working with the Harvard Library [not our group!] on this. In the first day, over a million queries were run against it. They are giving “ngrammies” for the best queries: best vs. beft (due to a character recognition error); fortnight; think outside the box vs. incentivize vs. strategize; argh vs aargh vs argh vs aaaargh. [They quickly go through some other fun word analyses, but I can’t keep up.]

“Cultoromics is the application of high throughput data collection and analysis to the study of culture.” Books are just the start. As more gets digitized, there will be more we can do. “We don’t have to wait for the copyright laws to change before we can use them.”

Q: Can you predict culture?
A: You should be able to make some sorts of predictions, but you have to be careful.

Q: Any examples of historians getting something wrong? [I think I missed the import of this]
A: Not much.

Q: Can you test the prediction ability with the presidential campaigns starting up.
A: Interesting.

Q: How about voice data? Music?
A: We’ve thought about it. It’d be a problem for copyright: if you transcribe a score, you have a copyright on it. This loads up the field with claimants. Also, it’s harder to detect single-note errors than single-letter errors.

Q: Do you have metadata to differentiate fiction from nonfiction, and genres?
A: Google has this metadata, but it comes from many providers and is full of conflicts. The ngram corpus is unclean. But the Harvard metadata is clean and we’re working with them.

Q: What are the IP implications?
A: There are many books Google cannot make available except through the ngram viewer. This gives digitizers a reason to digitize works they might otherwise leave alone.

Q: In China people use code words to talk about banned topics. This suppresses trending.
A: And that takes away some of the incentive to talk about it. It cuts off the feedback loop.

Q: [me] Is the corpus marked up with structural info that you can analyze against, e.g., subheadings, captions, tables, quotations?
A: We could but it’s a very hard problem. [Apparently the corpus is not marked up with this data already.]

Q: Might you be able to go from words to metatags: if you have cairo, sphinx, and egypt, you can induce “egypt.” This could have an effect on censorship since you can talk about someone without using her/his name.
A: The suppression of names may not be the complete suppression of mentions, yes. And, yes, that’s an important direction for us.

Follow me

Categories: berkman, copyright, too big to know Tagged with: 2b2k • berkman • google • irregular verbs • library Date: May 10th, 2011 dw

2 Comments »

May 7, 2011

World War II as a camping trip.

I’ve been re-reading a 1944 collection of amusing anecdotes assembled by Bennett Cerf, called Try and Stop Me. I’d read it as a child (I was born in 1950), and the celebrities in it belonged to my parents’ world — people like Herbert Bayard Swope, Alexander Woollcott, and Monty Woolley. Most of those names, huge in the 1930s, are completely unknown to the current generation, of course. Indeed, many are on the fringes of my own consciousness, or are beyond my recall entirely.

I’m finding it fascinating. Cerf was a television celebrity in the 1950s and 1960s, always with an amusing story. We are even on the verge of losing the word so often used to describe him: a raconteur. The anecdotes in Try and Stop Me concern authors, playwrights, poets, intellectuals, and actors. You do come away thinking that celebrity has taken a long walk downhill since then.

The attitudes and values the anecdotes betray are sometimes quite surprising. But here’s one that really floored me (which I’m presenting unedited):

Astute diagnosing by John Gunther [an important, popular historian] in his latest book, D Day: “The worst thing about war is that so many men like it … It relieves them of personal responsibilities…There is no worry about frictions at home or the dull necessity of earning a living. Military life is like a perpetual camping trip. I heard one officer say, ‘How nice all this would be if only you could eliminate the bloodshed and the killing.'” “Perhaps,” adds Orville Prescott [NY Times book critic], “peace planners who debate problems of frontiers and economics had better give a little more attention to eliminating the pleasures of soldierly comradeship and vast cooperative endeavor, the drama and excitement and the fun of war also.”

Can you imagine an historian saying the same thing about, say, the Afghanistan War, or Vietnam, for that matter? Did the American people really know so little about the horrors of WWII that they could believe that it was “like a perpetual camping trip” and oh so much fun? This seems to me to be beyond propaganda, but maybe I’m just underestimating how much propaganda can get away with.

(Tennozan: The Battle of Okinawa and the Atomic Bomb by George Feifer is a horrifying oral history of that particular “camping trip.”)

Follow me

Categories: culture, peace Tagged with: cerf • okinawa • war • wwii Date: May 7th, 2011 dw

7 Comments »

May 6, 2011

News is a wave

By coincidence, here are two related posts.

Gilad and Devin at Social Flow track the enormous kinetic energy of a single twitterer who figured out shortly before President Obama’s announcement that Osama Bin Laden had been killed. But my way of putting this — kinetic energy — is entirely wrong, since it was the energy stored within the Net that propelled that single tweet, from a person with about a thousand followers, across the webiverse. And the energy stored within the Net is actually the power of interest, the power of what we care about.

Meanwhile, The Berkman Center today announced the public availability of Media Cloud, a project Ethan Zuckerman and Hal Roberts led. Ethan explains it in a blog post that begins:

Today, the Berkman Center is relaunching Media Cloud, a platform designed to let scholars, journalists and anyone interested in the world of media ask and answer quantitative questions about media attention. For more than a year, we’ve been collecting roughly 50,000 English-language stories a day from 17,000 media sources, including major mainstream media outlets, left and right-leaning American political blogs, as well as from 1000 popular general interest blogs. (For much more about what Media Cloud does and how it does it, please see this post on the system from our lead architect, Hal Roberts.)

We’ve used what we’ve discovered from this data to analyze the differences in coverage of international crises in professional and citizen media and to study the rapid shifts in media attention that have accompanied the flood of breaking news that’s characterized early 2011. In the next weeks, we’ll be publishing some new research that uses Media Cloud to help us understand the structure of professional and citizen media in Russia and in Egypt.

Now Media Cloud is going to be a very useful tool. And it was not trivial to build. Congratulations to the team. And thank you.

Follow me

Categories: berkman, media Tagged with: berkman • media • news Date: May 6th, 2011 dw

1 Comment »

Acting Shakespeare

I’ve been reading John Barton’s book Playing Shakespeare, which pretty much transcribes a series of televised master classes. It’s a pretty amazing book, in which Barton claims that Shakespeare’s lines provide clues to how they should be read — the irregular stresses in the verse, the changes from prose to verse and back again.

I googled around trying to find the original series, but found these instead. Here are two ten-minute segments. Spanning the two is David Suchet reading Sonnet 138 several times, receiving direction. (I think I personally prefer his second reading.)

Follow me

Categories: culture Tagged with: acting • poetry • shakespeare • sonnets Date: May 6th, 2011 dw

2 Comments »

« Previous Page | Next Page »