Joho the Blog » crowdsourcing

September 24, 2013

[berkman][misc] Curated by the crowd

I’m at a Berkman lunchtime talk on crowdsourcing curation. Jeffrey Schnapp, Matthew Battles [twitter:matthewBattles] , and Pablo Barria Urenda are leading the discussion. They’re from the Harvard metaLab.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Matthew Battles begins by inviting us all to visit the Harvard center for Renaissance studies in Florence, Italy. [Don't toy with us, Matthew!] There’s a collection there, curated by Bernard Berenson, of 16,000 photos documenting art that can’t be located, which Berenson called “Homeless Paintings of the Italian Renaissance.” A few years ago, Mellon sponsored the digitization of this collection, to be made openly available. One young man, Chris Daley [sp?] has since found about 120 of the works. [This is blogged at the metaLab site.]

These 16,000 images are available at Harvard’s VIA image manager [I think]. VIA is showing its age. It doesn’t support annotation, etc. There are some cultural crowdsourcing projects already underway, e.g., Zooniverse’s Ancient Lives project for transcribing ancient manuscripts. metaLab is building a different platform: Curarium.com.

Matthew hands off to Jeffrey Schnapp. He says Curarium will allow a diverse set of communities (archivist, librarian, educator, the public, etc.) to animate digital collections by providing tools for doing a multiplicity things with those collections. We’re good at making collections, he says, but not as good at making those collections matter. Curarium should help take advantage of the expertise of distributed communities.

What sort of things will Curarium allow us to do? (A beta should be up in about a month.) Add metadata, add meaning to items…but also work with collections as aggregates. VIA doesn’t show relations among items. Curarium wants tomake collections visible and usable at the macro and micro levels, and to tell stories (“spotlights”).

Jeffrey hands off to Pablo, who walks us through the wireframes. Curarium will ingest records, and make them interoperable. They take in reords in JSON format, and extract the metadata they want. (They save the originals.) They’re working on how to give an overview of the collection; “When you have 11,000 records, thumbnails don’t help.” So, you’ll see a description and visualizations of the cloud of topic tags and items. (The “Homeless” collection has 2,000 tags.)

At the item level, you can annotate, create displays of selected content (“‘Spotlights’ are selections of records organized as thematized content”) in various formats (e.g., slideshow, more academic style, etc.). There will be a rich way of navigating and visualizing. There will be tools for the public, researchers, and teachers.

Q&A

Q: [me] How will you make the enhanced value available outside of Curarium? And, have you considered using Linked Data?

A: We’re looking into access. The data we have is coming from other places that have their own APIs, but we’re interested in this.

Q: You could take the Amazon route by having your own system use API’s, and then make those API’s open.

Q: How important is the community building? E.g., Zooniverse succeeds because people have incentives to participate.

A: Community-building is hugely important to us. We’ll be focusing on that over the next few months as we talk with people about what they want from this.

A: We want to expand the scope of conversation around cultural history. We’re just beginning. We’d love teachers in various areas — everything from art history to history of materials — to start experimenting with it as a teaching tool.

Q: The spotlight concept is powerful. Can it be used to tell the story of an individual object. E.g., suppose an object has been used in 200 different spotlights, and there might be a story in this fact.

A: Great question. Some of the richness of the prospect is perhap addressed by expectations we have for managing spotlights in the context of classrooms or networked teaching.

Q: To what extent are you thinking differently than a standard visual library?

A: On the design side, what’s crucial about our approach is the provision for a wide variety of activities, within the platform itself: curate, annotate, tell a story, present it… It’s a CMS or blogging platform as well. The annotation process includes bringing in content from outside of the environment. It’s a porous platform.

Q: To what extent can users suggest changes to the data model. E.g., Europeana has a very rigid data model.

A: We’d like a significant user contribution to metadata. [Linked Data!]

Q: Are we headed for a bifurcation of knowledge? Dedicated experts and episodic amateurs. Will there be a curator of curation? Am I unduly pessimistic?

A: I don’t know. If we can develop a system, maybe with Linked Data, we can have a more self-organizing space that is somewhere in between harmony and chaos. E.g., Wikimedia Loves Monuments is a wonderful crowd curatorial project.

Q: Is there anything this won’t do? What’s out of scope?

A: We’re not providing tools for creating animated gifs. We don’t want to become a platform for high-level presentations. [metaLab's Zeega project does that.] And there’s a spectrum of media we’ll leave alone (e.g., audio) because integrating them with other media is difficult.

Q: How about shared search, i.e., searching other collections?

A: Great idea. We haven’t pursued this yet.

Q: Custodianship is not the same as meta-curation. Chris Daly could become a meta-curator. Also, there’s a lot of great art curation at Pinterist. Maybe you should be doing this on top of Pinterest? Maybe built spotlight tools for Pinteresters?

A: Great idea. We already do some work along those lines. This project happens to emerge from contact with a particular collection, one that doesn’t have an API.

Q: The fact that people are re-uploading the same images to Pinterest is due to the lack of standards.

Q: Are you going to be working on the vocabulary, or let someone else worry about that?

A: So far, we’re avoiding those questions…although it’s already a problem with the tags in this collection.

[Looks really interesting. I'd love to see it integrate with the work the Harvard Library Interoperability Initiative is doing.]

Be the first to comment »

September 4, 2012

[2b2k] Crowdsourcing transcription

[This article is also posted at Digital Scholarship@Harvard.]

Marc Parry has an excellent article at the Chronicle of Higher Ed about using crowdsourcing to make archives more digitally useful:

Many people have taken part in crowdsourced science research, volunteering to classify galaxies, fold proteins, or transcribe old weather information from wartime ship logs for use in climate modeling. These days humanists are increasingly throwing open the digital gates, too. Civil War-era diaries, historical menus, the papers of the English philosopher Jeremy Bentham—all have been made available to volunteer transcribers in recent years. In January the National Archives released its own cache of documents to the crowd via its Citizen Archivist Dashboard, a collection that includes letters to a Civil War spy, suffrage petitions, and fugitive-slave case files.

Marc cites an article [full text] in Literary & Linguistic Computing that found that team members could have completed the transcription of works by Jeremy Bentham faster if they had devoted themselves to that task instead of managing the crowd of volunteer transcribers. Here are some more details about the project and its negative finding, based on the article in L&LC.

The project was supported by a grant of £262,673 from the Arts and Humanities Research Council, for 12 months, which included the cost of digitizing the material and creating the transcription tools. The end result was text marked up with TEI-compliant XML that can be easily interpreted and rendered by other apps.

During a six-month period, 1,207 volunteers registered, who together transcribed 1,009 manuscripts. 21% of those registered users actually did some transcribing. 2.7% of the transcribers produced 70% of all the transcribed manuscripts. (These numbers refer to the period before the New York Times publicized the project.)

Of the manuscripts transcribed, 56% were “deemed to be complete.” But the team was quite happy with the progress the volunteers made:

Over the testing period as a whole, volunteers transcribed an average of thirty-five manuscripts each week; if this rate were to be maintained, then 1,820 transcripts would be produced every twelve months. Taking Bentham’s difficult handwriting, the complexity and length of the manuscripts, and the text-encoding into consideration, the volume of work carried out by Transcribe Bentham volunteers is quite remarkable


Still, as Marc points out, two Research Associates spent considerable time moderating the volunteers and providing the quality control required before certifying a document as done. The L&LC article estimates that RA’s could have transcribed 400 transcripts per month, 2.5x faster than the pace of the volunteers. But, the volunteers got better as they were more experienced, and improvements to the transcription software might make quality control less of an issue.

The L&LC article suggests two additional reasons why the project might be considered a success. First, it generated lots of publicity about the Bentham collection. Second, “no funding body would ever provide a grant for mere transcription alone.” But both of these reasons depend upon crowdsourcing being a novelty. At some point, it will not be.

Based on the Bentham project’s experience, it seems to me there are a few plausible possibilities for crowdsourcing transcription to become practical: First, as the article notes, if the project had continued, the volunteers might have gotten substantially more productive and more accurate. Second, better software might drive down the need for extensive moderation, as the article suggests. Third, there may be a better way to structure the crowd’s participation. For example, it might be practical to use Amazon Mechanical Turk to pay the crowd to do two or three independent passes over the content, which can then be compared for accuracy. Fourth, algorithmic transcription might get good enough that there’s less for humans to do. Fifth, someone might invent something incredibly clever that increases the accuracy of the crowdsourced transcriptions. In fact, someone already has: reCAPTCHA transcribes tens of millions of words every day. So you never know what our clever species will come up with.

For now, though, the results of the Bentham project cannot be encouraging for those looking for a pragmatic way to generate high-quality transcriptions rapidly.

1 Comment »

July 6, 2011

Hey, kids, let’s play Spot the Lobbyist!

The Sunlight Foundation is crowdsourcing the identification of lobbyists from photos of them at Congressional hearings and other such events. The aim is to get a better sense of the lay of the land. The first round is a hearing on the merger of AT&T and T-Mobile (because lord knows all that competition is driving telecommunications prices into the ground!). Go here for the photo and instructions. (And a tip of the hat to BoingBoing.)

4 Comments »

February 1, 2011

What crowdsourcing looks like

Watch volunteers jump into and around the Google spreadsheet that’s coordinating the transcribing and translating of Egyptian voice-to-tweet msgs. Not exactly a Jerry Bruckheimer video, but the awesomeness of what we’re seeing crept up on me. (Check the link to the hi-rez version after you’ve read the TheNextWeb post; otherwise you can’t really see what’s going on.)

2 Comments »

November 17, 2010

[defrag] Scott Porad on how we fileter 0,000 user submisses per day

Scott Porad from the Cheezburger Network, a network of humor and entertainment Web sites, including I can Has Cheezburger. Memebase, The Daily What, and Failblog.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

It’s mainly user-moderated. As an example, Scott takes us through the steps for the Cheezburger site.

First, the home tab where you can submit content. The LOL builder makes it easy for users to add captions to images. They get 300,000-500,000 submissions to their network every month, but they only publish 1-2 percent. How do they cull? There’s no secret sauce, no magic algorithms. It’s a four-step human process.

Step 1: All submissions are screened by an editor, looking for image quality (not taken on a cellphone at night, etc.), appropriateness (no nudity, violence, racism), germaneness (a dog photo submitted to the cat site?), and keeping photos of humans out. Most of what gets submitted is junk, and gets screened out.

Step 2: Using the second tab, users vote or add a submission to their favorites. They also look at which content has been shared on social networks.

Step 3: User screening for offensiveness and copyright violations.

Step 4: Editorial curation.

They tried outsourcing it, but there’s too much specific to our culture, and requires too much editorial judgment.

Scott shows us his the favorite photos in his own account profile. ([Some very funny ones.]

Be the first to comment »

October 11, 2009

Net uncovers new type of cloud

There are reports of a new type of cloud, one that is not currently in the official International Cloud Atlas. Or, possibly, it is a formation that’s been around forever, but the scattered reports are only now coalescing thanks to the Net.

According to Amazon’s review of Richard Hamblyn’s The Invention of Clouds, we only began thinking clouds could be categorized in 1802 when Luke Howard started giving public lectures. The very idea that clouds — the paradigm of uncatchable — could be divided into groups was (apparently) fascinating and thrilling. (Lamarck had also categorized clouds, but it didn’t catch on.)

A quick googly scan makes it seem that the cloud taxonomy is pretty messy. For example, the University of Illinois’ “cloud types” page lists four broad categories, and a list of miscellaneous clouds, each of which is categorized under one of the four basic types, evoking a “Huh?” reaction from at least one of us. The cloud taxonomy page at Univ. Missouri-Columbia lists eight types. Do you categorize by what they look like, how high they are, what they do (rain or not?), which celebrity profiles they resemble …? Categorizing clouds is truly a Borgesian task.

And, dammit, wouldn’t you know? Here’s a poem by Jorge Luis Borges called: “Clouds (II)” (with the line-endings probably removed):

Placid mountains meander through the air, or tragic cordilleras cast a pall, overshadowing the day. They are what we call clouds. And their shapes are often strange and rare. Shakespeare observed one once. It seemed to be a dragon. That one cloud of an afternoon still kindles in his words and blazes down, so that we go on seeing it today. What are the clouds? An architecture of chance? Perhaps they are the necessary things from which God weaves his vast imaginings, threads of a web of infinite expanse. Maybe the cloud is emptiness returning, just like the man who watches it this morning.

(translated by Richard Barnes. B; Robert Mezey; Richard Barnes. “Clouds (II). (poem).” The American Poetry Review. World Poetry, Inc. 1996. HighBeam Research. 11 Oct. 2009 v)

More Borges poems

2 Comments »

August 26, 2009

Encyclopedia of Life – Now by Humans!

The Encyclopedia of Life is encouraging citizen contributions to its experts-vetted pages, so far with what seem like excellent results. There’s a good article about this at Science Daily. After two years, they’ve got 150,000 species pages underway, with 1.4 million stubs awaiting drafting.

[Tags: ]

4 Comments »

July 1, 2009

Crowd-sourcing photos

Steve Myers at Poynter has a good story about NPR’s crowd-sourcing Dollar Politics project. One element of it was a request for help identifying 200 people who attended a Senate hearing, some percentage of whom were lobbyists.

[Tags: ]

Be the first to comment »

May 13, 2009

TED translates

TED has started a great new project: Distributed translations of TED Talks. Taking a page from Global Voices, it’s crowd-sourcing translations.

This is exactly what should happen and is a great solution for relatively scarce resources such as TED talks. Figure out how to scale this and get yourself a Nobel prize.

By the way, TED has also introduced interactive transcripts: Click on a phrase in the transcript and the video skips to that spot. Very useful. And with a little specialized text editor, we could have the edit-video-by-editing-text app that I’ve been looking for.

[Tags: ]

Be the first to comment »

February 27, 2009

MIT Museum crowd-sources exhibition

MIT will be 150 years old in two years. So, the MIT Museum (where you can see Judith Donath’s arresting and provocative info-overwhelm installation, which opened last night) is asking the public to nominate objects to put on display. The nominations themselves will remain online forever after as a very different sort of permanent display.

[Tags: ]

11 Comments »

Next Page »


Switch to our mobile site