Joho the Blog » 2010 » November

November 30, 2010

[bigdata] Panel: A Thousand Points of Data

Paul Ohm (law prof at U of Colorado Law School — here’s a paper of his) moderates a panel among those with lots of data. Panelists: Jessica Staddon (research scientist, Google), Thomas Lento (Facebook), Arvin Narayanan (post-doc, Stanford), and Dan Levin (grad student, U of Mich).

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Dan Levin asks what Big Data could look like in the context of law. He shows a citation network for a Supreme Court decision. “The common law is a network,” he says. He shows a movie of the citation network of first thirty years of the Supreme Court. Fascinating. Marbury remains an edge node for a long time. In 1818, the net of internal references blooms explosively. “We could have a legalistic genome project,” he says. [Watch the video here.]

What will we be able to do with big data?

Thomas Lento (Facebook): Google flu tracking. Predicting via search terms.

Jessica Staddon (Google): Flu tracking works pretty well. We’ll see more personalization to deliver more relevant info. Maybe even tailor privacy and security settings.

Dan: If someone comes to you as a lawyer and ask if she has a case, you’ll do a better job deciding if you can algorithmically scour the PACER database of court records. We are heading for a legal informatics revolution.

Thomas: Imagine someone could tell you everything about yourself, and cross ref you with other people, say you’re like those people, and broadcast it to the world. There’d be a high potential for abuse. That’s something to worry about. Further, as data gets bigger, the granularity and accuracy of predictions gets better. E.g., we were able to beat the polls by doing sentiment analysis of msgs on Facebook that mention Obama or McCain. If I know who your friends are and what they like, I don’t actually have to know that much about you to predict what sort of ads to show you. As the computational power gets to the point where anyone can run these processes, it’ll be a big challenge…

Jessica: Companies have a heck of a lot to lose if they abuse privacy.

Helen Nissenbaum: The harm isn’t always to the individual. It can be harm to the democratic system. It’s not about the harm of getting targeted ads. It’s about the institutions that can be harmed. Could someone explain to me why to get the benefits of something like the Flu Trends you have to be targeted down to the individual level?

Jessica: We don’t always need the raw data for doing many types of trend analysis. We need the raw data for lots of other things.

Arvind: There are misaligned incentives everywhere. For the companies, it’s collect data first and ask questions yesterday; you never know what you’ll need.

Thomas: It’s hard to understand the costs and benefits at the individual level. We’re all looking to build the next great iteration or the next great product. The benefits of collecting all that data is not clearly defined. The cost to the user is unclear, especially down the line.

Jessica: Yes, we don’t really understand the incentives when it comes to privacy. We don’t know if giving users more control over privacy will actually cost us data.

Arvind describes some of his work on re-identification, i.e., taking anonymized data and de-anonymizing it. (Arvind worked on the deanonymizing of Netflix records.) Aggregation is a much better way of doing things, although we have to be careful about it.

Q: In other fields, we hear about distributed innovation. Does big data require companies to centralize it? And how about giving users more visibility into the data they’ve contributed — e.g., Judith Donath’s data mirrors? Can we give more access to individuals without compromising privacy?

Thomas: You can do that already at FB and Google. You can see what your data looks like to an outside person. But it’s very hard to make those controls understandable. There are capital expenditures to be able to do big data processing. So, it’ll be hard for individuals, although distributed processing might work.

Paul: Help us understand how to balance the costs and benefits? And how about the effect on innovation? E.g., I’m sorry that Netflix canceled round 2 of its contest because of the re-identification issue Arvind brought to light.

Arvind: No silver bullets. It can help to have a middleman, which helps with the misaligned incentives. This would be its own business: a platform that enables the analysis of data in a privacy-enabled environment. Data comes in one side. Analysis is done in the middle. There’s auditing and review.

Paul: Will the market do this?

Jessica: We should be thinking about systems like that, but also about the impact of giving the user more controls and transparency.

Paul: Big Data promises vague benefits — we’ll build something spectacular — but that’s a lot to ask for the privacy costs.

Paul: How much has the IRB (institutional review board) internalized the dangers of Big Data and privacy?

Daniel: I’d like to see more transparency. I’d like to know what the process is.

Arvind: The IRB is not always well suited to the concerns of computer scientists. Maybe current the monolithic structure is not the best way.

Paul: What mode of solution of privacy concerns gives you the most hope? Law? Self-regulation? Consent? What?

Jessica: The one getting the least attention is the data itself. At the root of a lot of privacy problems is the need to detect anomalies. Large data sets help with this detection. We should put more effort in turning the date around to use it for privacy protection.

Paul: Is there an incentive in the corporate environment?

Jessica: Google has taken some small steps in this direction. E.g., Google’s “got the wrong bob” tool for gmail that warns you if you seem to have included the wrong person in a multi-recipient email. [It's a useful tool. I send more email to the Annie I work with than to the Annie I'm married to, so my autocomplete keeps wanting to send Annie I work with information about my family. Got the wrong Bob catches those errors.]

Dan: It’s hard to come up with general solutions. The solutions tend to be highly specific.

Arvind: Consent. People think it doesn’t work, but we could reboot it. M. Ryan Calo at Stanford is working on “visceral notice,” rather than burying consent at the end of a long legal notice.

Thomas: Half of our users have used privacy controls, despite what people think. Yes, our controls could be simpler, but we’ve been working on it. We also need to educate people.

Q: FB keeps shifting the defaults more toward disclosure, so users have to go in and set them back.
Thomas: There were a couple of privacy migrations. It’s painful to transition users, and we let them adjust privacy controls. There is a continuum between the value of the service and privacy: all privacy and it would have no value. It also wouldn’t work if everything were open: people will share more if they feel they control who sees it. We think we’ve stabilized it and are working on simplification and education.

Paul: I’d pick a different metaphor: The birds flying south in a “privacy migration”…

Thomas: In FB, you have to manage all these pieces of content that are floating around; you can’t just put them in your “house” for them to be private. We’ve made mistakes but have worked on correcting them. It’s a struggle of a mode of control over info and privacy that is still very new.

1 Comment »

[bigdata] Ensuring Future Access to History

Brewster Kahle, Victoria Stodden, and Richard Cox are on a panel, chaired by the National Archive’s Director of Litigation Jason Baron. The conference is being put on by Princeton’s Center Internet for Technology Policy.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Brewster goes first. He’s going to talk about “public policy in the age of digital reproduction.” “We are in a jam,” he says, because of how we have viewed our world as our tech has change. Brewster founded the Internet Archive, a non-profit library. The aim is to make freely accessible everything ever published, from the Sumerian texts on. “Everyone everywhere ought to have access to it” — that’s a challenge worthy of our generation, he says.

He says the time is ripe for this. The Internet is becoming ubiquitous. If there aren’t laptops, there are Internet cafes. And there a mobiles. Plus, storage is getting cheaper and smaller. You can record “100 channel years” of HD TV in a petabyte for about $200,000, and store it in a small cabinet. For about $1,200, you could store all of the text in the Library of Congress. Google’s copy of the WWW is about a petabyte. The WayBack machine uses 3 petabytes, and has about 150 billion pages. It’s used by 1.5M/day. A small organization, like the Internet Archive, can take this task on.

This archive is dynamic, he says. The average Web page has 15 links. The average Web page changes every 100 days.

There are downsides to the archive. E.g., the WayBack Machine gets used to enable lawsuits. We don’t want people to pull out of the public sphere. “Get archived, go to jail,” is not a useful headline. Brewster says that they once got an FBI letter asking for info, which they successfully fought (via the EFF). The Archive gets lots of lawyer letters. They get about 50 requests per week to have material taken out of the Archive. Rarely do people ask for other people’s stuff to be taken down. Once, the Scientologists wanted some copyright-infringing material taken down from someone else’s archived site; the Archive finally agreed to this. The Archive held a conference and came up with Oakland Archive Policy for issues such as these.

Brewster points out that John Postel’s taxonomy is sticking: .com, .org, .gov, .edu, .mil … Perhaps we need separate policies for each of these, he says. And how do we take policy ideas and make them effective? E.g., if you put up a robots.txt exclusion, you will nevertheless get spidered by lots of people.

“We can build the Library of Alexandria,” he concludes, “but it might be problematic.”

Q: I’ve heard people say they don’t need to archive their sites because you will.
A: Please archive your own. More copies make us safe.

Q: What do you think about the Right to Oblivion movement that says that some types of content we want to self-destruct on some schedule, e.g. Facebook.
A: I have no idea. It’s really tough. Personal info is so damn useful. I wish we could keep our computers from being used against us in court; if we defined the 5th amendment so that who “we” are included our computers…


Richard Cox says if you gold, you know about info overload. It used to be that you had one choice of golf ball, Top-Flite. Now they have twenty varieties.

Archives are full of stories waiting to be told, he says. “When I think about Big Data…most archivists would think we’re talking about being science, corporate world, and government.” Most archivists work in small cultural, public institutions. Richard is going to talk about the shifting role of archivists.

As early as the 1940s, archivists were talking about machine-readable records. The debates and experiments have been going on for many decades. One early approach was to declare that electronic records were not archives, because the archives couldn’t deal with them. (Archivists and records managers have always been at odds, he says, because RM is about retention schedules, i.e., deleting records.) Over time, archivists came up to speed. By 2000, some were dealing with electronic records. In 2010, many do, but many do not. There is a continuing debate. Archivists have spent too long debating among themselves when they need to be talking with others. But, “archivists tend not to be outgoing folks.” (Archivists have had issues with the National Archives because their methods don’t “scale down.”)

There are many projects these days. E.g., we now have citizen archivists who maintain their own archives and who may contribute to public archives. Who are today’s archivists? Archival educators are redefining the role. Richard believes archives will continue, but the profession may not. He recommends reading the Clair report [I couldn't get the name or the spelling, and can't find it on Google :( ] on audio-visual archives. “I read it and I wept.” It says that we need people who understand the analog systems so that they can be preserved, but there’s no funding.


Victoria Stodden’s talk gloomy title is “The Coming Dark Ages in Scientific Knowledge.”

She begins by pointing to the pervasive use of computers and computational methods in the sciences, and even in the humanities and law schools. E.g., Northwestern is looking at the word counts in Shakespearean works. It’s changing the type of scientific analysis we’re doing. We can do very complicated simulations that give us a new way of understanding our world. E.g., we do simulations of math proofs, quite different from the traditional deductive processes.

This means what we’re doing as scientists is being stored in script, codes, data, etc. But science only is science when it’s communicated. If the data and scripts are not shared, the results are not reproducible. We need to act as scientists to make sure that this data etc. are shared. How do we communicate results based on enormous data sets? We have to give access to those data sets. And what happens when those data sets change (corrected or updated)? What happens to results based on the earlier sets? We need to preserve the prior versions of the data. How do we version it? How do we share it? How do we share it? E.g., There’s an experiment at NSF: All proposals have to include a data management plan. The funders and journals have a strong role to play here.

Sharing scientific knowledge is harder than it sounds, but is vital. E.g., a recent study showed that a cancer therapy will be particular effective based on individual genomes. But, it was extremely hard to trace back the data and code used to get this answer. Victoria notes that peer reviewers do not check the data and algorithms.

Why a dark age? Because “without reproducibility, knowledge cannot be recreated or understood.” we need ways and processes of sharing. Without this, we only have scientists making proclamations.

She gives some recommendations: (1) Assessment of the expense of data/code archiving. (2) Enforcement of funding agency guidelines. (3) Publication requirements. (4) Standards for scientific tools. (5) Versioning as a scientific principal. (6) Licensing to realign scientific intellectual property with longstanding scientific norms (Reproducible Research Standard). [verbatim from her slide] Victoria stresses the need to get past the hurdles copyright puts in the way.

Q: Are you a pessimist?
A: I’m an optimist. The scientific community is aware of these issues and is addressing them.

Q: Do we need an IRS for the peer review process?
A: Even just the possibility that someone could look at your code and data is enough to make scientists very aware of what they’re doing. I don’t advocate code checking as part of peer review because it takes too long. Instead, throw your paper out into the public while it’s still being reviewed and let other scientists have at it.

Q: [rick] Every age has lost more info than it has preserved. This is not a new problem. Every archivist from the beginning of time has had to cope with this.


Jason Baron of the National Archives (who is not speaking officially) points to the volume of data the National Archives (NARA) has to deal with. E.g., in 2001 32 million emails were transferred to NARA; in 2009, 250+ million archives were. He predicts there will be a billion presidential emails by 2017 held at NARA. The first lawsuit over email was filed in 1989 (email=PROFS). Right now, the official policy of 300 govt agencies is to print email out for archiving. We can no longer deal with the info flow with manual processes. Processing of printed pages occurs when there’s a lawsuit or a a FOIA request. Jason is pushing on the value of search as a way of encouraging systematic intake of digital records. He dreams of search algorithms that retrieve all relevant materials. There are clustering algorithms emerging within law that hold hope. He also wants to retrieve docs other than via key words. Visual analytics can help.

There are three languages we need: Legal, Records Management, and IT. How do we make the old ways work in the new? We need both new filtering techniques, but also traditional notions of appraisal. “The neutral archivist may serve as an unbiased resource for the filtering of information in an increasingly partisan (untrustworthy) world” [from the slide].

3 Comments »

November 29, 2010

Erred watching

Erred watching is like bird watching: You get excitedly happy when you spot a new, rare error message.

I’ve spotted two in the past twenty-four hours.

One was a problem with PDAnet (which turned out to be trivial to solve) that put a line into the Console for which there were zero Google hits: “CDaemonCon 2 exits”

This morning, my Mac refused to boot. Instead of showing me a gray apple on a gray background, it displayed a circle with a line through it (“prohibitory sign”) on the gray background. It did this even when I tried to boot from an external disk. Apple seems to think this is a software problem, although I would have thought that it would have booted from the external disk. But maybe there’s something wrong with the external one.

Anyway, I’m very excited to have spotted these two rarities in their native habitats. Of course, when I have to reinstall all my software and realize all the stuff I had not backed up, I may be warbling a different tune. (It’s a new machine supplied by work, and I have been keeping all my files in the cloud. I think.)

2 Comments »

If laws are outlawed, then only the outlaws will have the law

I cribbed that headline, approximately, from a comment at Slashdot about an article at The Escapist. It seems that the United States Copyright Group, a law firm that extorts (note: seemingly legally) settlements from people it thinks have violated copyright licenses, is suing a lawyer who posted $19.95 instructions on how to represent yourself if the USCG comes after you. (The USCG itself has been allegedly caught ripping off its Web page design.)

The lawyer with the temerity to make it easier for people to respond to the USCG’s shakedowns, Graham Syfert, includes the forms you need in his twenty buck bundle. The thought that someone might not just roll over and cough up the $2,500 settlement so infuriated the USCG that they are suing Syfert for the $5,000 in lost time they’ve had to spend actually litigating the suits. The nerve of that Syfert guy!

If only there were a word for someone who sails on up to you, demands $2,500, and opens fire on anyone who dares to help you defend yourself. Oh yeah, there is: Pirate.

2 Comments »

November 28, 2010

Annals of D’oh: Getting PDAnet to work on my Mac

I spent about an hour this morning trying to get PDAnet to work on my MacBookPro the way it once did. Had I not magically missed the question in the FAQ that addresses the issue, I could have saved myself 58 minutes of a Sunday morning.

PDAnet allows you to use your mobile phone as a modem. So, when there’s is no wifi, or when you don’t want to pay a hotel $25/day to use their stinking wifi, you can plug your mobile into your laptop via USB (PDAnet also supports Bluetooth, but as with every other Bluetooth app, it is as close to impossible to get to work as its evil masterminds could achieve) and surf the Web via your cellphone connection. Whatever data charges your cellphone company inflicts will apply.

PDAnet had been working like a dream for me. I had sprung for the paid version so that I could access secure sites. I was in tethered heaven. Then it stopped.

I uninstalled, redownloaded, rebooted, and tried again … and again and again. I wouldn’t get error msgs anywhere except in my browser when I tried to go to a site. The ping command in my terminal told me it couldn’t resolve the host. Weirdly, Console told me “before listening on daemon socket” and “CDaemonCon 2 exits.” These were only weird because the first has a single unhelpful hit at Google and the second has none. None! It’s an error message from another planet!

Then I re-checked the PDAnet help page. And there it says quite plainly: “Mac When connecting the PdaNet menubar icon keeps blinking and I have no Internet.” The explanation:

This is because the Network Interface create by PdaNet is not added on Mac automatically for some reason.

Try to open Networks Preferences and click on “+”, then select the new Network Interface the biggest “en” number such as en2 or en3 and click “Apply” with DHCP selected.

These instructions are almost correct. So, go to System Preferences, and then to Network. Click on “+” at the bottom of the list of connections on the left. Choose “Internet adaptor” from the list. “Internet adaptor” will have some number after it. After accepting, that connection will show up in the list on the left. Make sure that the choice to the right of “Configure IPv4″ says “DHCP.” You should be good to go.

PDAnet rocks when you remember to reading the effing FAQ.

D’oh.

4 Comments »

November 27, 2010

Rio’s violence bloggified

Debora Baldelli has a thought-provoking post at Global Voices about the reaction in social media to the recent violence in Rio de Janeiro.

My interest was initially piqued because I was in Rio a few weeks ago for a library conference, and found the city fascinating, regretting that I had given myself only one meagre afternoon free. The beaches were empty, and the tourist industry was just groggily waking itself up. Above the eerily unused festive booths, the poor look down, quite literally, from favelas wrapping the bases of the sudden peaks emblematic of the city. The mountains then continue up in inhuman, humbling, vertical lines.

Some cities a casual visitor for a day can fool himself into thinking he understands. Not Rio.

So, I was very interested to read Debora’s round-up of what the local social media had to say about the police reaction to a wave of violence in the city. For example:

The need to know what is true or false, and which areas were or were not being attacked, made @casodepolicia launch two hashtags #everdade (#truth) and #eboato (#rumor), through which information revealed on the web was verified in real time. The tweet reached 10,000 followers on the fifth day of the terror in the city.

Debora is positive about the overall contribution of social media:

… a good portion of the violence reported after this series of attacks was already common before. The sounds of shooting are not exactly anything new in Rio de Janeiro. What is different this time, however, is that everything is happening at the same time, and everything is being spoken of, reported and investigated as part of the same giant problem. The population of the city is being tempted to speak out and be heard (whether through the Disque Denúncia [hotline] or whether on Twitter), and being taken seriously by the authorities. When a person reports via tweet, sees their report being investigated, and hears of police action, this not only stimulates the participation of residents but also gives credibility to the police. Everybody wins.

Of course, the voices being heard in the social media do not come from the favelas, as least in Debora’s report. Matters will be different yet again when we can hear those voices, instead of just feeling their gaze.

1 Comment »

November 26, 2010

From facts to data to commons

Im keynoting a conference at the Princeton Center for Internet Polic on Big Data on Tuesday. No, I dont know why they asked me either. Heres an outline of what I plan on saying, although the talk has far from gelled.

“How sweet is the perception of a new natural fact,” Thoreau rhapsodized, and Darwin spent seven years trying to pin down one fact about barnacles. We were enamored with facts in the 19th century, when facts were used to nail down social policy debates. We have hoped and wished ever since that facts would provide a bottom to arguments: We can dig our way down and come to agreement. Indeed, in 1963, Bernard Forscher wrote a letter to Science “Chaos in the Brickyard” complaining that young scientists are generating too many facts and not enough theories; too many facts leads to chaos.

This is part and parcel of the traditional Western strategy for knowing our world. In a world too big to knowâ„¢, our basic strategy has been to filter, reduce, and fragment knowledge. This was true all the way through the Information Age. Our fear of information overload now seems antiquated. Not only is there “no such thing as information overload, only filter failure” Clay Shirky, natch, in the digital age, the nature of filters change. On the Net, we do not filter out. We filter forward. That is, on the Net, a filter merely shortens the number of clicks it takes to get to an object; all the other objects remain accessible.

This changes the role and nature of expertise, and of knowledge itself. For traditional knowledge is a system of stopping points for inquiry. This is very efficient, but its based on the limitations of paper as knowledges medium. Indeed, science itself has viewed itself as a type of publishing: It is done in private and not made public until its certain-ish.

But the networking of knowledge gives us a new strategy. We will continue to use the old one where appropriate. Networked knowledge is abundant, unsettled, never done, public, imperfect, and contains disagreements within itself.

So, lets go back to facts. [Work on that transition!] In the Age of Facts, we thought that facts provided a picture of the world — the Book of Nature was written in facts. Now we are in the Age of Big Data. This is different from the Info Age, when data actually was fairly scarce. The new data is super-abundant, linked, fallible, and often recognizably touched by frail humans. Unlike with facts, these data are [Note to self: Remember to use plural...the sign of quality!] often used to unnail, rather than to nail things down. While individual arguments, of course, use data to support hypotheses or to nail down conclusions, the system weve built for ourselves overall is more like a resource to support exploration, divergence, and innovation. Despite Bernard Forscher, too many facts in the form of data, do not lead to chaos, but to a commons.

4 Comments »

November 24, 2010

Rich Net users are different

A new Pew Internet survey confirms some obvious assumptions as well as some not so obvious ones about differences in how the Net is used by those with more money and those with less.

For example, U.S. households with an income of $75K tend to have faster connections and more Net devices. But also:

  • “Even among those who use the internet, the well off are more likely than those with less income to use technology.”

  • The richer are more likely to get their news online.

  • “Some 86% of internet users in higher-income households go online daily, compared with 54% in the lowest income bracket.”

  • “79% of the internet users in the higher earning bracket have visited a government website at the local, state or federal level versus 56% of those who fall into the lowest-income group”

Obviously, there may well be other correlations going on here. But it’s an interesting report, and one that confirms for those who need it that the Net is different depending on the circumstances within which it is embedded.

1 Comment »

November 23, 2010

Radio Berkman: Wikipedia

The latest Radio Berkman podcast is up. This time, it’s with Joseph Reagle, author Good Faith Collaboration, about the culture of Wikipedia. And as a special bonus, if you act now (or later), there’s a bonus interview with Zack Exley, Chief Community Officer for the Wikimedia Foundation.

5 Comments »

Three predictions

Someone on a mailing list I’m on asked the list to come up with predictions for 2011. Here are mine:

1. The Atlantic runs a cover story showing that the entire Cognitive Surplus (as per Clay Shirky) created by the Net has been squandered by a massive increase in masturbation.

2. Arrival of Forbin Project/Skynet comes one step closer when airport X-ray scanners are discovered to have spontaneously reinvented ChatRoulette for themselves. (Bonus prediction: 2b. At least one Republican Senator and/or Televangelist is “outed” when he is recorded telling a TSA employee “I think you missed a spot.”)

3. AOL wakes up, feels its long beard, and is “wondrous amazed” at all that has changed in the past ten years.

1 Comment »

Next Page »


Switch to our mobile site