Joho the Blog » big data

October 11, 2016

[liveblog] Vinny Senguttuvan on Predicting Customers

Vinny Senguttuvan is Senior Data Scientist at METIS. Before that, he was at Facebook-based gaming company, High 5 Games, which had 10M users. His talk at PAPIs: “Predicting Customers.”

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

The main challenge: Most of the players play for free. Only 2% ever spend money on the site, buying extra money to play. (It’s not gambling because you never cash out). 2% of those 2% contribute the majority of the revenue.

All proposed changes go through A/B testing. E.g., should we change the “Buy credits” button from blue to red. This is classic hypothesis testing. So you put up both options and see which gets the best results. It’s important to remember that there’s a cost to the change, so the A-B preference needs to be substantial enough. But often the differences are marginal. So you can increase the sample size. This complicates the process. “A long list of changes means not enough time per change.” And you want to be sure that the change affects the paying customers positively, which means taking even longer.

When they don’t have enough samples, they can bring down the confidence level required to make the change. Or they could bias one side of the hypothesis. And you can assume the variables are independent and run simultaneous A-B tests on various variables. High 5 does all three. It’s not perfect but it works.

Second, there is a poularity metric by which they rank or classify their 100 games. They constantly add games — it went from 15 to 100 in two years. This continuously changes the ranking of the games. Plus, some are launched locked. This complicates things. Vinny’s boss came up with a model of an n-dimensional casino, but it was too complex. Instead, they take 2 simple approaches: 1. An average-weighted spin. 2. Bayesian. Both predicted well but had flaws, so they used a type of average of both.

Third: Survival analysis. They wanted to know how many users are still active a given time after they created their account, and when is a user at risk of discontinuing use. First, they grouped users into cohorts (people who joined within a couple of weeks of each other) and plotted survival rates over time. They also observed return rates of users after each additional day of absence. They also implement a Cox survival model. They found that newer users were more likely to decline in their use of the product; early users are more committed. This pattern is widespread. That means they have to continuously acquire new players. They also alert users when they reach the elbow of disuse.

Fourth: Predictive lifetime value. Lifetime value = total revenue from a user over the entire time the the produced. This is significant because of costs: 10-15% of the rev goes into ads to acquire customers. Their 365 day prediction model should be a time series, but they needed results faster, so they flipped it into a regression problem, predicting the 365 day revenue based on the user’s first month data: how they spent, purchase count, days of play, player level achievement, and the date joined. [He talks about regression problems, but I can’t keep up.] At that point it cost $2 to acquire a customer from FB ad, and $6 from mobile apps. But when they tested, the mobile acquisitions were more profitable than those that came from through FB. It turned out that FB was counting as new users any player who hadn’t played in 30 days, and was re-charging them for it. [I hope I got that right.]

Fifth: Recommendation systems. Pandora notes the feature of songs and uses this to recommend similarities. YouTube makes recommendations made based on relations among users. Non-matrix factorization [I’m pretty sure he just made this up] gives you the ability to predict the score for a video that you know nothing about in terms of content. But what if the ratings are not clearly defined? At High 5, there are no explicit ratings. They calculated a rating based on how often a player plays it, how long the session, etc. And what do you do about missing values: use averages. But there are too many zeroes in the system, so they use sparse matrix solvers. Plus, there is a semi-order to the games, so they used some human input. [Useful for library Stackscores

Comments Off on [liveblog] Vinny Senguttuvan on Predicting Customers

September 30, 2015

The miracle of the one network

The Open University of Catalania just posted a very brief article of mine about the importance of the fact that Big Data is also Networked Big Data. Upon reading it in “print” I see that I buried the lede.

The amazing thing is that the same network that connects our machines also connects us. This enables a seamless conversation: “if you can get at the data, you can get at people talking about the data”if you can get at the data, you can get at people talking about the data.

Not only does the same network connect the data and the people making sense of the data, but layers of interoperability have grown on top of it. Increasingly the data is accessible in ways that make it easier and easier for humans to mash it up. And, of course, the sense that humans make of those mashups gets expressed in ways that are interoperable for humans: in language, with links.

That we take this awesomeness for granted makes that awesomeness awesome.

Comments Off on The miracle of the one network

January 2, 2014

[2b2k] Social Science in the Age of Too Big to Know

Gary King [twitter:kinggarry] , Director of Harvard’s Institute for Quantitative Social Science, has published an article (Open Access!) on the current status of this branch of science. Here’s the abstract:

The social sciences are undergoing a dramatic transformation from studying problems to solving them; from making do with a small number of sparse data sets to analyzing increasing quantities of diverse, highly informative data; from isolated scholars toiling away on their own to larger scale, collaborative, interdisciplinary, lab-style research teams; and from a purely academic pursuit focused inward to having a major impact on public policy, commerce and industry, other academic fields, and some of the major problems that affect individuals and societies. In the midst of all this productive chaos, we have been building the Institute for Quantitative Social Science at Harvard, a new type of center intended to help foster and respond to these broader developments. We offer here some suggestions from our experiences for the increasing number of other universities that have begun to build similar institutions and for how we might work together to advance social science more generally.

In the article, Gary argues that Big Data requires Big Collaboration to be understood:

Social scientists are now transitioning from working primarily on their own, alone in their officesâ??a style that dates back to when the offices were in monasteriesâ??to working in highly collaborative, interdisciplinary, larger scale, lab-style research teams. The knowledge and skills necessary to access and use these new data sources and methods often do not exist within any one of the traditionally defined social science disciplines and are too complicated for any one scholar to accomplish alone

He begins by giving three excellent examples of how quantitative social science is opening up new possibilities for research.

1. Latanya Sweeney [twitter:LatanyaSweeney] found “clear evidence of racial discrimination” in the ads served up by newspaper websites.

2. A study of all 187M registered voters in the US showed that a third of those listed as “inactive” in fact cast ballots, “and the problem is not politically neutral.”

3. A study of 11M social media posts from China showed that the Chinese government is not censoring speech but is censoring “attempts at collective action, whether for or against the government…”

Studies such as these “depended on IQSS infrastructure, including access to experts in statistics, the social sciences, engineering, computer science, and American and Chinese area studies. ”

Gary also points to “the coming end of the quantitative-qualitative divide” in the social sciences, as new techniques enable massive amounts of qualitative data to be quantified, enriching purely quantitative data and extracting additional information from the qualitative reports.

Instead of quantitative researchers trying to build fully automated methods and qualitative researchers trying to make do with traditional human-only methods, now both are heading toward using or developing computer-assisted methods that empower both groups.

We are seeing a redefinition of social science, he argues:

We instead use the term “social science” more generally to refer to areas of scholarship dedicated to understanding, or improving the well-being of, human populations, using data at the level of (or informative about) individual people or groups of people.

This definition covers the traditional social science departments in faculties of schools of arts and science, but it also includes most research conducted at schools of public policy, business, and education. Social science is referred to by other names in other areas but the definition is wider than use of the term. It includes what law school faculty call “empirical research,” and many aspects of research in other areas, such as health policy at schools of medicine. It also includes research conducted by faculty in schools of public health, although they have different names for these activities, such as epidemiology, demography, and outcomes research.

The rest of the article reflects on pragmatic issues, including what this means for the sorts of social science centers to build, since community is “by far the most important component leading to success…” ” If academic research became part of the X-games, our competitive event would be “‘extreme cooperation'”.

1 Comment »

December 24, 2013…now for datasets!

I had a chance to talk with Dan Brickley today, a semanticizer of the Web whom I greatly admire. He’s often referred to as a co-creator of FOAF, but these days he’s at Google working on He pointed me to the work Schema has been doing with online datasets, which I hadn’t been aware of. Very interesting., as you probably know, provides a set of terms you can hide inside the HTML of your page that annotate what the visible contents are about. The major search engines — Google, Bing, Yahoo, Yandex — notice this markup and use it to provide more precise search results, and also to display results in ways that present the information more usefully. For example, if a recipe on a page is marked up with terms, the search engine can identify the list of ingredients and let you search on them (“Please find all recipes that use butter but not garlic”) and display them in a more readable away. And of course it’s not just the search engines that can do this; any app that is looking at the HTML of a page can also read the Schema markup. There are schemas for an ever-expanding list of types of information…and now datasets.

If you go to and scroll to the bottom where it says “Properties from Dataset,” you’ll see the terms you can insert into a page that talk specifically about the dataset referenced. It’s quite simple at this point, which is an advantage of overall. But you can see some of the power of even this minimal set of terms over at Google’s experimental Schema Labs page where there are two examples.

The first example (click on the “view” button) does a specialized Google search looking for pages that have been marked up with Schema’s Dataset terms. In the search box, try “parking,” or perhaps “military.” Clicking on a return takes you to the original page that provides access to the dataset.

The second demo lets you search for databases related to education via the work done by LRMI (Learning Resource Metadata Initiative); the LRMI work has been accepted (except for the term useRightsUrl) as part of Click on the “view” button and you’ll be taken to a page with a search box, and a menu that lets you search the entire Web or a curated list. Choose “entire Web” and type in a search term such as “calculus.”

This is such a nice extension of Schema was designed initially to let computers parse information on human-readable pages (“Aha! ‘Butter’ on this page is being used as a recipe ingredient and on that page as a movie title“), but now it can be used to enable computers to pull together human-readable lists of available datasets.

I continue to be a fan of Schema because of its simplicity and pragmatism, and, because the major search engines look for Schema markup, people have a compelling reason to add markup to their pages. Obviously Schema is far from the only metadata scheme we need, nor does it pretend to be. But for fans of loose, messy, imperfect projects that actually get stuff done, Schema is a real step forward that keeps taking more steps forward.

Comments Off on…now for datasets!

November 15, 2013

[liveblog][2b2k] Saskia Sassen

The sociologist Saskia Sassen is giving a plenary talk at Engaging Data 2013. [I had a little trouble hearing some of it. Sorry. And in the press of time I haven’t had a chance to vet this for even obvious typos, etc.]

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

1. The term Big Data is ambiguous. “Big Data” implies we’re in a technical zone. it becomes a “technical problem” as when morally challenging technologies are developed by scientists who thinks they are just dealing with a technical issue. Big Data comes with a neutral charge. “Surveillance” brings in the state, the logics of power, how citizens are affected.

Until recently, citizens could not relate to a map that came out in 2010 that shows how much surveillance there is in the US. It was published by the Washington Post, but it didn’t register. 1,271 govt orgs and 1,931 private companies work on programs related to counterterrorism, homeland security and intelligence. There are more than 1 million people with stop-secret clearance, and maybe a third are private contractors. In DC and enirons, 33 building complexes are under construction or have been built for top-secret intelligence since 9/11. Together they are 22x the size of Congress. Inside these environments, the govt regulates everything. By 2010, DC had 4,000 corporate office buildings that handle classified info,all subject to govt regulation. “We’re dealing with a massive material apparatus.” We should not be distracted by the small individual devices.

Cisco lost 28% of its sales, in part as a result of its being tainted by the NSA taking of its data. This is alienating citzens and foreign govts. How do we stop this? We’re dealing with a kind of assemblage of technical capabilities, tech firms that sell the notion that for security we all have to be surveilled, and people. How do we get a handle on this? I ask: Are there spaces where we can forget about them? Our messy, nice complex cities are such spaces. All that data cannot be analyzed. (She notes that she did a panel that included the brother of a Muslim who has been indefinitely detained, so now her name is associated with him.)

3. How can I activate large, diverse spaces in cities? How can we activate local knowledges? We can “outsource the neighborhood.” The language of “neighborhood” brings me pleasure, she says.

If you think of institutions, they are codified, and they notice when there are violations. Every neighborhood has knowledge about the city that is different from the knowledge at the center. The homeless know more about rats than the center. Make open access networks available to them into a reverse wiki so that local knowledge can find a place. Leak that knowledge into those codified systems. That’s the beginning of activating a city. From this you’d get a Big Data set, capturing the particularities of each neighborhood. [A knowledge network. I agree! :)]

The next step is activism, a movement. In my fantasy, at one end it’s big city life and at the other it’s neighborhood residents enabled to feel that their knowledge matters.


Q: If local data is being aggregated, could that become Big Data that’s used against the neighborhoods?

A: Yes, that’s why we need neighborhood activism. The polticizing of the neighborhoods shapes the way the knowledge isued.

Q: Disempowered neighborhoods would be even less able to contribute this type of knowledge.

A: The problem is to value them. The neighborhood has knowledge at ground level. That’s a first step of enabling a devalued subject. The effect of digital networks on formal knowledge creates an informal network. Velocity itself has the effect of informalizing knowledge. I’ve compared environmental activists and financial traders. The environmentalists pick up knowledge on the ground. So, the neighborhoods may be powerless, but they have knowledge. Digital interactive open access makes it possible bring together those bits of knowledge.

Q: Those who control the pipes seem to control the power. How does Big Data avoid the world being dominated by brainy people?

A: The brainy people at, say, Goldman Sachs are part of a larger institution. These institutions have so much power that they don’t know how to govern it. The US govt has been the post powerful in the world, with the result that it doesn’t know how to govern its own power. It has engaged in disastrous wars. So “brainy people” running the world through the Ciscos, etc., I’m not sure. I’m talking about a different idea of Big Data sets: distributed knowledges. E.g, Forest Watch uses indigenous people who can’t write, but they can tell before the trained biologists when there is something wrong in the ecosystem. There’s lots of data embedded in lots of places.

[She’s aggregating questions] Q1: Marginalized neighborhoods live being surveilled: stop and frisk, background checks, etc. Why did it take tapping Angela Merkel’s telephone to bring awareness? Q2: How do you convince policy makers to incorporate citizen data? Q3: There are strong disincentives to being out of the mainstream, so how can we incentivize difference.

A: How do we get the experts to use the knowledge? For me that’s not the most important aim. More important is activating the residents. What matters is that they become part of a conversation. A: About difference: Neighborhoods are pretty average places, unlike forest watchers. And even they’re not part of the knowledge-making circuit. We should bring them in. A: The participation of the neighborhoods isn’t just a utility for the central govt but is a first step toward mobilizing people who have been reudced to thinking that they don’t count. I think is one of the most effective ways to contest the huge apparatus with the 10,000 buildings.

Comments Off on [liveblog][2b2k] Saskia Sassen

[liveblog] Noam Chomsky and Bart Gellman at Engaging Data

I’m at the Engaging Data 2013conference where Noam Chomsky and Pulitzer Prize winner (twice!) Barton Gellman are going to talk about Big Data in the Snowden Age, moderated by Ludwig Siegele of the Economist. (Gellman is one of the three people Snowden vouchsafed his documents with.) The conference aims at having us rethink how we use Big Data and how it’s used.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

LS: Prof. Chomsky, what’s your next book about?

NC: Philosophy of mind and language. I’ve been writing articles that are pretty skeptical about Big Data. [Please read the orange disclaimer: I’m paraphrasing and making errors of every sort.]

LS: You’ve said that Big Data is for people who want to do the easy stuff. But shouldn’t you be thrilled as a linguist?

NC: When I got to MIT at 1955, I was hired to work on a machine translation program. But I refused to work on it. “The only way to deal with machine translation at the current stage of understanding was by brute force, which after 30-40 years is how it’s being done.” A principled understanding based on human cognition is far off. Machine translation is useful but you learn precisely nothing about human thought, cognition, language, anything else from it. I use the Internet. Glad to have it. It’s easier to push some buttons on your desk than to walk across the street to use the library. But the transition from no libraries to libraries was vastly greater than the transition from librarites to Internet. [Cool idea and great phrase! But I think I disagree. It depends.] We can find lots of data; the problem is understanding it. And a lot of data around us go through a filter so it doesn’t reach us. E.g., the foreign press reports that Wikileaks released a chapter about the secret TPP (Trans Pacific Partnership). It was front page news in Australia and Europe. You can learn about it on the Net but it’s not news. The chapter was on Intellectual Property rights, which means higher prices for less access to pharmaceuticals, and rams through what SOPA tried to do, restricting use of the Net and access to data.

LS: For you Big Data is useless?

NC: Big data is very useful. If you want to find out about biology, e.g. But why no news about TPP? As Sam Huntington said, power remains strongest in the dark. [approximate] We should be aware of the long history of surveillance.

LS: Bart, as a journalist what do you make of Big Data?

BG: It’s extraordinarily valuable, especially in combination with shoe-leather, person-to-person reporting. E.g., a colleague used traditional reporting skills to get the entire data set of applicants for presidential pardons. Took a sample. More reporting. Used standard analytics techniques to find that white people are 4x more likely to get pardons, that campaign contributors are also more likely. It would be likely in urban planning [which is Senseable City Labs’ remit]. But all this leads to more surveillance. E.g., I could make the case that if I had full data about everyone’s calls, I could do some significant reporting, but that wouldn’t justify it. We’ve failed to have the debate we need because of the claim of secrecy by the institutions in power. We become more transparent to the gov’t and to commercial entities while they become more opaque to us.

LS: Does the availability of Big Data and the Internet automatically mean we’ll get surveillance? Were you surprised by the Snowden revelations>

NC: I was surprised at the scale, but it’s been going on for 100 years. We need to read history. E.g., the counter-insurgency “pacification” of the Philippines by the US. See the book by McCoy [maybe this. The operation used the most sophisticated tech at the time to get info about the population to control and undermine them. That tech was immediately used by the US and Britain to control their own populations, .g., Woodrow Wilson’s Red Scare. Any system of power — the state, Google, Amazon — will use the best available tech to control, dominate, and maximize their power. And they’ll want to do it in secret. Assange, Snowden and Manning, and Ellsberg before them, are doing the duty of citizens.

BG: I’m surprised how far you can get into this discussion without assuming bad faith on the part of the government. For the most part what’s happening is that these security institutions genuinely believe most of the time that what they’re doing is protecting us from big threats that we don’t understand. The opposition comes when they don’t want you to know what they’re doing because they’re afraid you’d call it off if you knew. Keith Alexander said that he wishes that he could bring all Americans into this huddle, but then all the bad guys would know. True, but he’s also worried that we won’t like the plays he’s calling.

LS: Bruce Schneier says that the NSA is copying what Google and Yahoo, etc. are doing. If the tech leads to snooping, what can we do about it?

NC: Govts have been doing this for a century, using the best tech they had. I’m sure Gen. Alexander believes what he’s saying, but if you interviewed the Stasi, they would have said the same thing. Russian archives show that these monstrous thugs were talking very passionately to one another about defending democracy in Eastern Europe from the fascist threat coming from the West. Forty years ago, RAND released Japanese docs about the invasion of China, showing that the Japanese had heavenly intentions. They believed everything they were saying. I believe these are universals. We’d probably find it for Genghis Khan as well. I have yet to find any system of power that thought it was doing the wrong thing. They justify what they’re doing for the noblest of objectives, and they believe it. The CEOs of corporations as well. People find ways of justifying things. That’s why you should be extremely cautious when you hear an appeal to security. It literally carries no information, even in the technical sense: it’s completely predictable and thus carries no info. I don’t doubt that the US security folks believe it, but it is without meaning. The Nazis had their own internal justifications.

BG: The capacity to rationalize may be universal, but you’ll take the conversation off track if you compare what’s happening here to the Stasi. The Stasi were blackmailing people, jailing them, preventing dissent. As a journalist I’d be very happy to find that our govt is spying on NGOs or using this power for corrupt self-enriching purposes.

NC: I completely agree with that, but that’s not the point: The same appeal is made in the most monstrous of circumstances. The freedom we’ve won sharply restricts state power to control and dominate, but they’ll do whatever they can, and they’ll use the same appeals that monstrous systems do.

LS: Aren’t we all complicit? We use the same tech. E.g., Prof. Chomsky, you’re the father of natural language processing, which is used by the NSA.

NC: We’re more complicit because we let them do it. In this country we’re very free, so we have more responsibility to try to control our govt. If we do not expose the plea of security and separate out the parts that might be valid from the vast amount that’s not valid, then we’re complicit because we have the oppty and the freedom.

LS: Does it bug you that the NSA uses your research?

NC: To some extent, but you can’t control that. Systems of power will use whatever is available to them. E.g., they use the Internet, much of which was developed right here at MIT by scientists who wanted to communicate freely. You can’t prevent the powers from using it for bad goals.

BG: Yes, if you use a free online service, you’re the product. But if you use a for-pay service, you’re still the product. My phone tracks me and my social network. I’m paying Verizon about $1,000/year for the service, and VZ is now collecting and selling my info. The NSA couldn’t do its job as well if the commercial entities weren’t collecting and selling personal data. The NSA has been tapping into the links between their data centers. Google is racing to fix this, but a cynical way of putting this is that Google is saying “No one gets to spy on our customers except us.”

LS: Is there a way to solve this?

BG: I have great faith that transparency will enable the development of good policy. The more we know, the more we can design policies to keep power in place. Before this, you couldn’t shop for privacy. Now a free market for privacy is developing as the providers now are telling us more about what they’re doing. Transparency allows legislation and regulation to be debated. The House Repubs came within 8 votes of prohibiting call data collection, which would have been unthinkable before Snowden. And there’s hope in the judiciary.

NC: We can do much more than transparency. We can make use of the available info to prevent surveillance. E.g., we can demand the defeat of TPP. And now hardware in computers is being designed to detect your every keystroke, leading some Americans to be wary of Chinese-made computers, but the US manufacturers are probably doing it better. And manufacturers for years have been trying to dsign fly-sized drones to collect info; that’ll be around soon. Drones are a perfect device for terrorists. We can learn about this and do something about it. We don’t have to wait until it’s exposed by Wikileaks. It’s right there in mainstream journals.

LS: Are you calling for a political movement?

NC: Yes. We’re going to need mass action.

BG: A few months ago I noticed a small gray box with an EPA logo on it outside my apartment in NYC. It monitors energy usage, useful to preventing brown outs. But it measures down to the apartment level, which could be useful to the police trying to establish your personal patterns. There’s no legislation or judicial review of the use of this data. We can’t turn back the clock. We can try to draw boundaries, and then have sufficient openness so that we can tell if they’ve crossed those boundaries.

LS: Bart, how do you manage the flow of info from Snowden?

BG: Snowden does not manage the release of the data. He gave it to three journalists and asked us to use your best judgment — he asked us to correct for his bias about what the most important stories are — and to avoid direct damage to security. The documents are difficult. They’re often incomplete and can be hard to interpret.


Q: What would be a first step in forming a popular movement?

NC: Same as always. E.g., the women’s movement began in the 1960s (at least in the modern movement) with consciousness-raising groups.

Q: Where do we draw the line between transparency and privacy, given that we have real enemies?

BG: First you have to acknowledge that there is a line. There are dangerous people who want to do dangerous things, and some of these tools are helpful in preventing that. I’ve been looking for stories that elucidate big policy decisions without giving away specifics that would harm legitimate action.

Q: Have you changed the tools you use?

BG: Yes. I keep notes encrypted. I’ve learn to use the tools for anonymous communication. But I can’t go off the grid and be a journalist, so I’ve accepted certain trade-offs. I’m working much less efficiently than I used to. E.g., I sometimes use computers that have never touched the Net.

Q: In the women’s movement, at least 50% of the population stood to benefit. But probably a large majority of today’s population would exchange their freedom for convenience.

NC: The trade-off is presented as being for security. But if you read the documents, the security issue is how to keep the govt secure from its citizens. E.g., Ellsberg kept a volume of the Pentagon Papers secret to avoid affecting the Vietnam negotiations, although I thought the volume really only would have embarrassed the govt. Security is in fact not a high priority for govts. The US govt is now involved in the greatest global terrorist campaign that has ever been carried out: the drone campaign. Large regions of the world are now being terrorized. If you don’t know if the guy across the street is about to be blown away, along with everyone around, you’re terrorized. Every time you kill an Al Qaeda terrorist, you create 40 more. It’s just not a concern to the govt. In 1950, the US had incomparable security; there was only one potential threat: the creation of ICBM’s with nuclear warheads. We could have entered into a treaty with Russia to ban them. See McGeorge Bundy’s history. It says that he was unable to find a single paper, even a draft, suggesting that we do something to try to ban this threat of total instantaneous destruction. E.g., Reagan tested Russian nuclear defenses that could have led to horrible consequences. Those are the real security threats. And it’s true not just of the United States.

1 Comment »

[2b2k] Big Data and the Commons

I’m at the Engaging Big Data 2013 conference put on by Senseable City Lab at MIT. After the morning’s opener by Noam Chomsky (!), I’m leading one of 12 concurrent sessions. I’m supposed to talk for 15-20 mins and then lead a discussion. Here’s a summary of what I’m planning on saying:

Overall point: To look at the end state of the knowledge network/Commons we want to get to

Big Data started as an Info Age concept: magnify the storage and put it on a network. But you can see how the Net is affecting it:

First, there are a set of values that are being transformed:
– From accuracy to scale
– From control to innovation
– From ownership to collaboration
– From order to meaning

Second, the Net is transforming knowledge, which is changing the role of Big Data
– From filtered to scaled
– From settled to unsettled and under discussion
– From orderly to messy
– From done in private to done in public
– From a set of stopping points to endless lilnks

If that’s roughly the case, then we can see a larger Net effect. The old Info Age hope (naive, yes, but it still shows up at times) was that we’d be able to create models that ultimate interoperate and provide an ever-increasing and ever-more detailed integrated model of the world. But in the new Commons, we recognize that not only won’t we ever derive a single model, there is tremendous strength in the diversity of models. This Commons then is enabled if:

  • All have access to all
  • There can be social engagement to further enrich our understanding
  • The conversations default to public

So, what can we do to get there? Maybe:

  • Build platforms and services
  • Support Open Access (and, as Lewis Hyde says, “beat the bounds” of the Commons regularly)
  • Support Linked Open Data

Questions if the discussion needs kickstarting:

  • What Big Data policies would help the Commons to flourish?
  • How can we improve the diversity of those who access and contribute to the Commons?
  • What are the personal and institutional hesitations that are hindering the further development of the Commons?
  • What role can and should Big Data play in knowledge-focused discussions? With participants who are not mathematically or statistically inclined?
  • Does anyone have experience with Linked Data? Tell us about it?


I just checked the agenda, which of course I should have done earlier, and discovered that of the 12 sessions today, 1211 are being led by men. Had I done that homework, I would not have accepted their invitation.


May 26, 2013

[2b2k] Is big data degrading the integrity of science?

Amanda Alvarez has a provocative post at GigaOm:

There’s an epidemic going on in science: experiments that no one can reproduce, studies that have to be retracted, and the emergence of a lurking data reliability iceberg. The hunger for ever more novel and high-impact results that could lead to that coveted paper in a top-tier journal like Nature or Science is not dissimilar to the clickbait headlines and obsession with pageviews we see in modern journalism.

The article’s title points especially to “dodgy data,” and the item in this list that’s by far the most interesting to me is the “data reliability iceberg,” and its tie to the rise of Big Data. Amanda writes:

…unlike in science…, in big data accuracy is not as much of an issue. As my colleague Derrick Harris points out, for big data scientists the abilty to churn through huge amounts of data very quickly is actually more important than complete accuracy. One reason for this is that they’re not dealing with, say, life-saving drug treatments, but with things like targeted advertising, where you don’t have to be 100 percent accurate. Big data scientists would rather be pointed in the right general direction faster — and course-correct as they go – than have to wait to be pointed in the exact right direction. This kind of error-tolerance has insidiously crept into science, too.

But, the rest of the article contains no evidence that the last sentence’s claim is true because of the rise of Big Data. In fact, even if we accept that science is facing a crisis of reliability, the article doesn’t pin this on an “iceberg” of bad data. Rather, it seems to be a melange of bad data, faulty software, unreliable equipment, poor methodology, undue haste, and o’erweening ambition.

The last part of the article draws some of the heat out of the initial paragraphs. For example: “Some see the phenomenon not as an epidemic but as a rash, a sign that the research ecosystem is getting healthier and more transparent.” It makes the headline and the first part seem a bit overstated — not unusual for a blog post (not that I would ever do such a thing!) but at best ironic given this post’s topic.

I remain interested in Amanda’s hypothesis. Is science getting sloppier with data?


October 1, 2012

[sogeti] Andrew Keen on Vertigo and Big Data

Andrew Keen is speaking. (I liveblogged him this spring when he talked at a Sogeti conference.) His talk’s title: “How today’s online social revolution is dividing, diminishing, and disorienting us.” [Note: Posted without rereading because I’m about to talk. I may go back and do some cleanup.]

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Andrew opens with an anecdote. He grew up as a Jew in Britain. His siblings were split between becoming lawyers or doctors. But his mother asked him if he’d like to be the anti-Christ. So, now he’s grown up to become the anti-Christ of Silicon Valley.

“I’m not usually into intimacy,” but look at each other. How much do we know about each other? Not much. One of the great joys is getting to know one another. By 2017 there will 15x more data flowing over the network. Billions of intelligent devices. “The world we are going into is one in which 2o-25 years…you strangers will show up in a big city in London and you’ll know everything about each other.” You’ll know one another’s histories, interests…

“My argument is that we’re all stuck in Digital Vertigo. We’re all participants in a digital noir.” He shows a clip from Vertigo. “In the future these kinds of scenes won’t be possible. There won’t be private detectives…So this movie about the unfolding of understanding between strangers won’t happen.” What happens to policing. “Will we be guilty if we don’t carry our devices.” [SPOILERS] The blonde in this movie doesn’t exist. She’s a brunette shopgirl from Kansas. “The movie is about a deception…A classic Hitchcock narrative of falling in love with something that doesn’t exist. A good Catholic narrative…It’s a warning about falling in love with something that is too good to be true.” That’s what we’re doing with social media nd big data. We’re told big data brings us together. They tell us the Net gives us the opportunity for human beings to come together, to realize themselves as social beings. Big data allows us to become human.

This is about more than the Net. The revolution that Carlotta is talking about is one in which the Net becomes central in the way we live our lives. Fifteen years ago, Doc Searls, David W., and I would be marginal computer nerds, and now our books can be found in any book store. [Doc is in the audience also.]

He shows a clip from The Social Network: “We lived on farms. Now we’re going to live on the Internet.” It’s the platform of 21st century life. This is not a marginal or media issue. It is about the future of society. Many people this network will solve the core problems of life. We now have an ecosystem of apps in the business of eliminating loneliness. E.g., Highlight, “the darling of the recent SxSW show.” They say it’s “a fun way to learn more about people nearby.” Then he shows a clip from The Truman Show. His point: We’re all in our own Truman Shows. The destruction of privacy. No difference between public and private. We’re being authentic. We’re knowingly involving ourselves in this.

A quote: “Vertigo is the ultimate critics’ film because it is a dreamlike film about people who are not sure who they but who are busy econstructing themselves and each other to a f=kind of cinema ideal of the ideal soul mate.” Substitute social media for film. We’re losing what it means to be using. We’re destroying the complexity of our inner lives. We’re only able to live externally. [This is what happens when your conceptual two poles are public and private. It changes when we introduce the term “social.”]

Narcissism isn’t new. Digital narcissism has reached a climax. As we’re given personal broadcasting platforms, we’re increasingly deluded into thinking we’re interesting and important. Mostly it reveals our banality, our superficiality. [This is what you get when your conceptual poles are taken from broadcast media.]

It’s not just digital narcissism. “Visibility is a trap,” said Foucault. Hypervisibility is a hypertrap. Our data is central to Facebook and others becoming viable businesses. The issue is the business model. Data is oil, and it’s owned by the rich. Zuckerberg, Reed Hoffman, et al., are data barons. Read Susan Cain’s “Quiet”: introverts drive innovation. E.g., Steve Wozniak. Sharing is not good for innovation. Discourage your employees from talking with one another all the time. It makes them less thoughtful. It creates groupthink. If you want them to think for themselves, “take away their devices and put them in dark rooms.”

It’s also a trap when it comes to govt. Many govts are using the new tech to spy on their citizens. Cf. Bentham’s panopticon, which was corrupted into 1984 and industrial totalitarianism. We need to go back to the Industrial Age and JS Mill — Mill’s On Liberty is the best antidote to Bentham’s utilitarianism. [? I see more continuity than antidote.]

To build a civilized golden age: 1. There is a role for govt. The market needs regulation. 2. “I’m happy with the EU is working on this…and came out against FB facial recognition software. … We have a right to forget.” “It’s the most unhuman of things to remember everything.” “We shouldn’t idolize the never-forgetting nature of Big Data.” “To forget and forgive is the core essence of being human.” 3. We need better business models. We don’t want data to be the new oil. I want businesses that charge. “The free economy has been a catastrophe.”

He shows the end of The Truman Show. [SPOILER] As Truman enters reality, it’s a metaphor for our hope. We can only protect our humanness by retreating into dark, quiet places.

He finishes with a Vermeer that shows us a woman about which we know nothing. In our Age of Facebook, we need to build a world in which the woman in blue can read that letter, not reveal herself, not reveal her mystery…”

Q: You’re surprising optimistic today. In the movie Vertigo, there’s an inevitability. How about the inevitability of this social movement? Are you tilting at windmills.

Idealists tilt at windmills. People are coming to around to understanding that the world we’re collectively creating is not quite right. It’s making people uneasy. More and more books, articles, etc., that FB is deeply exploitative. We’re all like Jimmy Stewart in Vertigo. The majority of people in the world don’t want to give away their data. As more of the traditional world comes onto the Net, there will be more resistant to collapsing the private and the public. Our current path is not inevitable. Tech is religion. Tech is not autonomous, not a first mover. We created Big Data and need to reestablish our domination over it. I’m cautiously optimistic. But it could go wrong, especially in authoritarian regimes. In Silicon Valley people say privacy is dead, get over it. But privacy is essential. Once we live this public ideal, then who are we.

Comments Off on [sogeti] Andrew Keen on Vertigo and Big Data

[2b2k][sogeti] Big Data conference session

I’m at Sogeti‘s annual executive conference, which brings together about 80 CEOs. I’m here with Doc Searls, Andrew Keen, and others. I’ve spoken at other Sogeti events, and I am impressed with their commitment to providing contrary points of view — including views at odds with their own corporate interests. (My one complaint: They expect all attendees to have an iPad or iPhone so that they can participate in on the realtime survey. Bad symbolism.) (Disclosure: They’re paying me to speak. They are not paying me to say something nice about them.)

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Menno van Doorn begins by talking about the quantified self movement, claiming that they sometimes refer to themselves as “datasexuals” :) All part of Big Data, he says. To give us an idea of bigness, he relates the Legend of Sessa: “Give me grain, doubling the amount for each square on a chessboard.” Exponential growth meant that by the time you hit the second half of the chessboard, you’re in impossible numbers. Experts say that’s where we were in 2006 when it comes to data. But “there’s no such thing as too much data.” “Big Data is powering the next industrial revolution. Data is the new oil.”

Big Data is about (1) lots of data, (2) at high velocity, (3) using in a variety of ways. (“volume, velocity, variety.”) Michael Chui says that there’s billions in revenues to gain, including from efficiencies. But, Chui says, there are no best practices. The value comes from “human exhaust.” I.e., your digital footprint, what you leave behind in your movement through the Net. Menno thinks of this as “your recorded future.”

Three examples:

1. Menno points to Target, a company that can predict life-changing events among its customers. E.g., based on purchases of 25 products, they can predict which customers are pregnant and roughly when they are due. But, this led to Target sending promotional materials for pregnancy to young girls whose parents learned this way that their daughters were pregnant.

2. In SF, they send out police cars to neighborhoods based on 14-day predictions of where crime will occur, based on data about prior crime patterns.

3. Schufa, a German credit agency, announced they’d use social media to assess your credit worthiness. Immediately a German Minister said, “Schufa cannot become the Big Brother of the beusiness world.”

Two forces are in contention and will determine how much Big Data changes us. Today, the conference will look at the dawn of the age of big data, and then how disruptive it will be for society (the session Keen and I are in). Day 2: Bridging the gap to the new paradigm, Big Data’s fascinating future, and Decision Time: Taming Big Brother.


Carlota Perez, Prof. of Tech and Socio-Economic Development, from Venezuela speaks now.. She is a “neo-Schumpeterian.” She says her role in the conference is “locate the current crisis.” What is the real effect on innovation, and why are we only midways along in feeling the impact?

There have been 5 tech revolutions in the past 240 yeares: 1. 1771 Industrial rev. 1829. Age of steam, coal and railways. 3. 1875 Steel and heavy engineering (the first globalization). 4. Age of he automobile, oril, petrochem and mass production 5. 1971 Age of info tech and telecom. We’re only halfway through that last one. The next revolution queued up: age of biotech, bioelectronics, nanotech, and new materials. [I’m surprised she doesn’t count telegrapgh + radio + telephone, etc., as a comms rev. And I’d separate the Net as its own rev. But that’s me.]

Lifecycle of a tech rev: gestation, induction, deployment, exhaustion. The “big bang” tends to happen when the prior rev is reaching exhaustion. The structure of revs: new cheap inputs, new products, new processes. A new infrastructure arise. And a constellation of new dynamic industries that grow the world economy.

Why call these “revolutions”, she asks? Because they transform the whole economy. They bring new organizational principles and new best practice models. I.e. , a new “techno-economic paradigm.” E.g., we’ve gone from mass production to flexible production. Closed pyramids to open networks. Stable routines to continuous improvement. “Information technology finds change natural.” From human resources to human capital (from raw materials to value). Suppliers and clients to value network partners. Fixed plans to flexible strategies. Three-tier markets (big,medium,small) to hyper-segmented markets. Internationalization to globalization. Information as costly burden to info as asset. Together, these constitute a radical change in managerial common sense.

The diffusion process is broken in two: Bubble, followed by a crash, and then the Golden Age. During the bubble, financial capital forces diffusion. There is income and demand polarization. Then the crash. Then there is an institutional recomposition, leading to a golden age in which everyone benefits. Production capital takes over from financial capital (driven by the govt), and there is better distribution of income and demand.

She looks at the 5 revs, and finds the same historic pattern that she just sketched.

wo major differences between installation and deployment: 1. Bubbles vs. patient (= long-term) capital. 2. Concentrated innovation to modernize industries vs. innovation in all industries that use the new technologies. “Understanding this sequence is essential for strategic thinking.”

The structure of innovation in deployment: pa new coherent fabric of the economy emerges, leading to a golden age. Also, oligopolies emerge which means there’s less unhelpful competition. (?)

Example of prior rev: home electrical applicances: In the installation period, we had a bunch of electric utilities going into homes in the 1910s and 1930s. During the revision, we get a few more. But then in the 1950-70s. we get a surge of new applicances, including tape recorder, microwave, even the electric toothbrush. It’s enabled by universal electricity and driven by suburbinization. It’s the same pattern if you look at textile fibers, from rayon and acetate during instlation, to a huge number during deployment. E.g., structural and packaging plastics: installation brought bakelite, polystyrene and polyethylene, and then a flood of innovation during deployment. “The various systems of the ICT revolution will follow a similar sequence.” [Unless it follows the Tim Wu pattern of consolidation — e.g., everyone being required to use an iPad at a conference] During installation period, ICT was in constant supply push mode. Now must respond to demand pull. “The paradigm and its potential are now understood by all. Demand (in vol and nature) becomes the driving force.

This shifts the role of the CIO. To modernize a mature company, during installation you brought in an expert in modernization, articulating the hw and sw being pushed by the suppliers. During the deployment phase, a modern company that is innovating for strategic expansion, the CIO is an expert in strategy, specifying needs and working with suppliers. “The CIO is no longer staff. S/he must be directly involved in strategy.”

There are 3 main forces for innovation in the next 2-3 decades, as is true for all the revs. 1. Deepening and widening of the ICT tech rev, responding to user needs. 2. The users of ICT across all industries and activities. 3. The gestation of the next rev (probably bioteech, nanotech, and new materials).

Big Data is likely have a big role in each of those directions.

Q: Why are we only 50% of the way through?

A: Because the change after the recession is like opening a dam. Once you get to the point where you can have a comfortable innovation prospective, imagine the market possibilities.

Q: What can go wrong?

A: Governments. Unfettered free markets are indispensable for the installation process. Lightly guided markets are needed in the golden age. Free markets work when you need to force everyone to change. But now no longer: The state has to come in . But govts are drunk with free markets. Now finance is incompetent. “They don’t dare invest in real things.” Ideology is so strong and the understanding of history is so shallow that we’re not doing the right thing.”


Christopher Ahlberg speaks now. He’s the founder of Recorded Future. His topic: “Turning the Web into Predictive Signals.”

We see events like Arab Spring and wonder if we could have predicted them. Three things are going on: 1. Moving from smaller to larger datasets. 2. From structured to unstructured data (from numbers to text). 3. From corporate data to Internet/Web.

There’s a “seismic shift in intelligence” “emporal indexing of the Web enables Web intelligence.” The Web is not organized for finding date; it’s about finding documents.” Can we create structure for the Web we can use for analysis? A lot of work has been done on this. Why is this possible now? Fast math, large, fast storage, web harvesting, and linguistic analysis progress.

His company looks for signals in human language. E.g., temporal signals. That can turn up competitive info. But human language is tough to deal with. But also when something happens — e.g., Haitian earthquake — there are patterns in when people show up: helpers, doctors, military, do-gooder actors, etc. There tends to be a flood of notifications immediately afterwards. The Recorded Data platform does the linguistic analysis.

He gives an example: What’s going to happen to Merck over the next 90 days. Some is predictable: There will be a quarterly financial conference all. A key drug is up for approval. Can we look into the public conversations about these events, and might this guide our stock purchases? And beyond Merck, we could look at everything from cyber attacks to sales opportunities.

Some examples. 1. Monitoring unrest. Last week there were protests against Foxconn in China. Analysis of Chinese media shows that most of those protests were inland, while corporate expansion is coming in coastal areas. Or look at protests against pharmaceuticals for animal testing.

Example 2: Analyzing cyber threats. Hackers often try out an approach on a small scale and then go larger. This can give us warning.

Example 3: Competitive intelligence. When is there a free space — announcement-free — when you can get some attention. Example 4: Lead generation. E.g., look for changes in management. (New marketing person might need a new PR agency.) Exasmple 5: Trading patterns. E.g., if there’s bad news but insiders are buying.

Conclusion: As we move from small to large datasets, structured to unstructured, and from inside to outside the company, we go from surprise to foresight.

Q: What is the question you cannot answer?

A: The situations that have low frequency. It’s important that there be an opportunity for follow-up questions.

Q: What if you don’t know what the right question is?

A: When it’s unknown unknowns, you can’t ask the right question. But the great thing about visualizaton is that it helps people ask questions.

Q: How to distinguish fact from opinion on Twitter, etc.?

A: Or NYT vs. Financial Post. There isn’t a simple answer. We’re working toward being able to judge sources based on known outcomes.

Q: Do your predictions get more accurate the more data you have?

A: Generally yes, but it’s not always that simple.

Comments Off on [2b2k][sogeti] Big Data conference session

Next Page »