Joho the Blog » big data

May 26, 2013

[2b2k] Is big data degrading the integrity of science?

Amanda Alvarez has a provocative post at GigaOm:

There’s an epidemic going on in science: experiments that no one can reproduce, studies that have to be retracted, and the emergence of a lurking data reliability iceberg. The hunger for ever more novel and high-impact results that could lead to that coveted paper in a top-tier journal like Nature or Science is not dissimilar to the clickbait headlines and obsession with pageviews we see in modern journalism.

The article’s title points especially to “dodgy data,” and the item in this list that’s by far the most interesting to me is the “data reliability iceberg,” and its tie to the rise of Big Data. Amanda writes:

…unlike in science…, in big data accuracy is not as much of an issue. As my colleague Derrick Harris points out, for big data scientists the abilty to churn through huge amounts of data very quickly is actually more important than complete accuracy. One reason for this is that they’re not dealing with, say, life-saving drug treatments, but with things like targeted advertising, where you don’t have to be 100 percent accurate. Big data scientists would rather be pointed in the right general direction faster — and course-correct as they go – than have to wait to be pointed in the exact right direction. This kind of error-tolerance has insidiously crept into science, too.

But, the rest of the article contains no evidence that the last sentence’s claim is true because of the rise of Big Data. In fact, even if we accept that science is facing a crisis of reliability, the article doesn’t pin this on an “iceberg” of bad data. Rather, it seems to be a melange of bad data, faulty software, unreliable equipment, poor methodology, undue haste, and o’erweening ambition.

The last part of the article draws some of the heat out of the initial paragraphs. For example: “Some see the phenomenon not as an epidemic but as a rash, a sign that the research ecosystem is getting healthier and more transparent.” It makes the headline and the first part seem a bit overstated — not unusual for a blog post (not that I would ever do such a thing!) but at best ironic given this post’s topic.

I remain interested in Amanda’s hypothesis. Is science getting sloppier with data?

Be the first to comment »

October 1, 2012

[sogeti] Andrew Keen on Vertigo and Big Data

Andrew Keen is speaking. (I liveblogged him this spring when he talked at a Sogeti conference.) His talk’s title: “How today’s online social revolution is dividing, diminishing, and disorienting us.” [Note: Posted without rereading because I'm about to talk. I may go back and do some cleanup.]

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Andrew opens with an anecdote. He grew up as a Jew in Britain. His siblings were split between becoming lawyers or doctors. But his mother asked him if he’d like to be the anti-Christ. So, now he’s grown up to become the anti-Christ of Silicon Valley.

“I’m not usually into intimacy,” but look at each other. How much do we know about each other? Not much. One of the great joys is getting to know one another. By 2017 there will 15x more data flowing over the network. Billions of intelligent devices. “The world we are going into is one in which 2o-25 years…you strangers will show up in a big city in London and you’ll know everything about each other.” You’ll know one another’s histories, interests…

“My argument is that we’re all stuck in Digital Vertigo. We’re all participants in a digital noir.” He shows a clip from Vertigo. “In the future these kinds of scenes won’t be possible. There won’t be private detectives…So this movie about the unfolding of understanding between strangers won’t happen.” What happens to policing. “Will we be guilty if we don’t carry our devices.” [SPOILERS] The blonde in this movie doesn’t exist. She’s a brunette shopgirl from Kansas. “The movie is about a deception…A classic Hitchcock narrative of falling in love with something that doesn’t exist. A good Catholic narrative…It’s a warning about falling in love with something that is too good to be true.” That’s what we’re doing with social media nd big data. We’re told big data brings us together. They tell us the Net gives us the opportunity for human beings to come together, to realize themselves as social beings. Big data allows us to become human.

This is about more than the Net. The revolution that Carlotta is talking about is one in which the Net becomes central in the way we live our lives. Fifteen years ago, Doc Searls, David W., and I would be marginal computer nerds, and now our books can be found in any book store. [Doc is in the audience also.]

He shows a clip from The Social Network: “We lived on farms. Now we’re going to live on the Internet.” It’s the platform of 21st century life. This is not a marginal or media issue. It is about the future of society. Many people this network will solve the core problems of life. We now have an ecosystem of apps in the business of eliminating loneliness. E.g., Highlight, “the darling of the recent SxSW show.” They say it’s “a fun way to learn more about people nearby.” Then he shows a clip from The Truman Show. His point: We’re all in our own Truman Shows. The destruction of privacy. No difference between public and private. We’re being authentic. We’re knowingly involving ourselves in this.

A quote: “Vertigo is the ultimate critics’ film because it is a dreamlike film about people who are not sure who they but who are busy econstructing themselves and each other to a f=kind of cinema ideal of the ideal soul mate.” Substitute social media for film. We’re losing what it means to be using. We’re destroying the complexity of our inner lives. We’re only able to live externally. [This is what happens when your conceptual two poles are public and private. It changes when we introduce the term "social."]

Narcissism isn’t new. Digital narcissism has reached a climax. As we’re given personal broadcasting platforms, we’re increasingly deluded into thinking we’re interesting and important. Mostly it reveals our banality, our superficiality. [This is what you get when your conceptual poles are taken from broadcast media.]

It’s not just digital narcissism. “Visibility is a trap,” said Foucault. Hypervisibility is a hypertrap. Our data is central to Facebook and others becoming viable businesses. The issue is the business model. Data is oil, and it’s owned by the rich. Zuckerberg, Reed Hoffman, et al., are data barons. Read Susan Cain’s “Quiet”: introverts drive innovation. E.g., Steve Wozniak. Sharing is not good for innovation. Discourage your employees from talking with one another all the time. It makes them less thoughtful. It creates groupthink. If you want them to think for themselves, “take away their devices and put them in dark rooms.”

It’s also a trap when it comes to govt. Many govts are using the new tech to spy on their citizens. Cf. Bentham’s panopticon, which was corrupted into 1984 and industrial totalitarianism. We need to go back to the Industrial Age and JS Mill — Mill’s On Liberty is the best antidote to Bentham’s utilitarianism. [? I see more continuity than antidote.]

To build a civilized golden age: 1. There is a role for govt. The market needs regulation. 2. “I’m happy with the EU is working on this…and came out against FB facial recognition software. … We have a right to forget.” “It’s the most unhuman of things to remember everything.” “We shouldn’t idolize the never-forgetting nature of Big Data.” “To forget and forgive is the core essence of being human.” 3. We need better business models. We don’t want data to be the new oil. I want businesses that charge. “The free economy has been a catastrophe.”

He shows the end of The Truman Show. [SPOILER] As Truman enters reality, it’s a metaphor for our hope. We can only protect our humanness by retreating into dark, quiet places.

He finishes with a Vermeer that shows us a woman about which we know nothing. In our Age of Facebook, we need to build a world in which the woman in blue can read that letter, not reveal herself, not reveal her mystery…”

Q: You’re surprising optimistic today. In the movie Vertigo, there’s an inevitability. How about the inevitability of this social movement? Are you tilting at windmills.

Idealists tilt at windmills. People are coming to around to understanding that the world we’re collectively creating is not quite right. It’s making people uneasy. More and more books, articles, etc., that FB is deeply exploitative. We’re all like Jimmy Stewart in Vertigo. The majority of people in the world don’t want to give away their data. As more of the traditional world comes onto the Net, there will be more resistant to collapsing the private and the public. Our current path is not inevitable. Tech is religion. Tech is not autonomous, not a first mover. We created Big Data and need to reestablish our domination over it. I’m cautiously optimistic. But it could go wrong, especially in authoritarian regimes. In Silicon Valley people say privacy is dead, get over it. But privacy is essential. Once we live this public ideal, then who are we.

Be the first to comment »

[2b2k][sogeti] Big Data conference session

I’m at Sogeti‘s annual executive conference, which brings together about 80 CEOs. I’m here with Doc Searls, Andrew Keen, and others. I’ve spoken at other Sogeti events, and I am impressed with their commitment to providing contrary points of view — including views at odds with their own corporate interests. (My one complaint: They expect all attendees to have an iPad or iPhone so that they can participate in on the realtime survey. Bad symbolism.) (Disclosure: They’re paying me to speak. They are not paying me to say something nice about them.)

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Menno van Doorn begins by talking about the quantified self movement, claiming that they sometimes refer to themselves as “datasexuals” :) All part of Big Data, he says. To give us an idea of bigness, he relates the Legend of Sessa: “Give me grain, doubling the amount for each square on a chessboard.” Exponential growth meant that by the time you hit the second half of the chessboard, you’re in impossible numbers. Experts say that’s where we were in 2006 when it comes to data. But “there’s no such thing as too much data.” “Big Data is powering the next industrial revolution. Data is the new oil.”

Big Data is about (1) lots of data, (2) at high velocity, (3) using in a variety of ways. (“volume, velocity, variety.”) Michael Chui says that there’s billions in revenues to gain, including from efficiencies. But, Chui says, there are no best practices. The value comes from “human exhaust.” I.e., your digital footprint, what you leave behind in your movement through the Net. Menno thinks of this as “your recorded future.”

Three examples:

1. Menno points to Target, a company that can predict life-changing events among its customers. E.g., based on purchases of 25 products, they can predict which customers are pregnant and roughly when they are due. But, this led to Target sending promotional materials for pregnancy to young girls whose parents learned this way that their daughters were pregnant.

2. In SF, they send out police cars to neighborhoods based on 14-day predictions of where crime will occur, based on data about prior crime patterns.

3. Schufa, a German credit agency, announced they’d use social media to assess your credit worthiness. Immediately a German Minister said, “Schufa cannot become the Big Brother of the beusiness world.”

Two forces are in contention and will determine how much Big Data changes us. Today, the conference will look at the dawn of the age of big data, and then how disruptive it will be for society (the session Keen and I are in). Day 2: Bridging the gap to the new paradigm, Big Data’s fascinating future, and Decision Time: Taming Big Brother.

 


Carlota Perez, Prof. of Tech and Socio-Economic Development, from Venezuela speaks now.. She is a “neo-Schumpeterian.” She says her role in the conference is “locate the current crisis.” What is the real effect on innovation, and why are we only midways along in feeling the impact?

There have been 5 tech revolutions in the past 240 yeares: 1. 1771 Industrial rev. 1829. Age of steam, coal and railways. 3. 1875 Steel and heavy engineering (the first globalization). 4. Age of he automobile, oril, petrochem and mass production 5. 1971 Age of info tech and telecom. We’re only halfway through that last one. The next revolution queued up: age of biotech, bioelectronics, nanotech, and new materials. [I'm surprised she doesn't count telegrapgh + radio + telephone, etc., as a comms rev. And I'd separate the Net as its own rev. But that's me.]

Lifecycle of a tech rev: gestation, induction, deployment, exhaustion. The “big bang” tends to happen when the prior rev is reaching exhaustion. The structure of revs: new cheap inputs, new products, new processes. A new infrastructure arise. And a constellation of new dynamic industries that grow the world economy.

Why call these “revolutions”, she asks? Because they transform the whole economy. They bring new organizational principles and new best practice models. I.e. , a new “techno-economic paradigm.” E.g., we’ve gone from mass production to flexible production. Closed pyramids to open networks. Stable routines to continuous improvement. “Information technology finds change natural.” From human resources to human capital (from raw materials to value). Suppliers and clients to value network partners. Fixed plans to flexible strategies. Three-tier markets (big,medium,small) to hyper-segmented markets. Internationalization to globalization. Information as costly burden to info as asset. Together, these constitute a radical change in managerial common sense.

The diffusion process is broken in two: Bubble, followed by a crash, and then the Golden Age. During the bubble, financial capital forces diffusion. There is income and demand polarization. Then the crash. Then there is an institutional recomposition, leading to a golden age in which everyone benefits. Production capital takes over from financial capital (driven by the govt), and there is better distribution of income and demand.

She looks at the 5 revs, and finds the same historic pattern that she just sketched.

wo major differences between installation and deployment: 1. Bubbles vs. patient (= long-term) capital. 2. Concentrated innovation to modernize industries vs. innovation in all industries that use the new technologies. “Understanding this sequence is essential for strategic thinking.”

The structure of innovation in deployment: pa new coherent fabric of the economy emerges, leading to a golden age. Also, oligopolies emerge which means there’s less unhelpful competition. (?)

Example of prior rev: home electrical applicances: In the installation period, we had a bunch of electric utilities going into homes in the 1910s and 1930s. During the revision, we get a few more. But then in the 1950-70s. we get a surge of new applicances, including tape recorder, microwave, even the electric toothbrush. It’s enabled by universal electricity and driven by suburbinization. It’s the same pattern if you look at textile fibers, from rayon and acetate during instlation, to a huge number during deployment. E.g., structural and packaging plastics: installation brought bakelite, polystyrene and polyethylene, and then a flood of innovation during deployment. “The various systems of the ICT revolution will follow a similar sequence.” [Unless it follows the Tim Wu pattern of consolidation — e.g., everyone being required to use an iPad at a conference] During installation period, ICT was in constant supply push mode. Now must respond to demand pull. “The paradigm and its potential are now understood by all. Demand (in vol and nature) becomes the driving force.

This shifts the role of the CIO. To modernize a mature company, during installation you brought in an expert in modernization, articulating the hw and sw being pushed by the suppliers. During the deployment phase, a modern company that is innovating for strategic expansion, the CIO is an expert in strategy, specifying needs and working with suppliers. “The CIO is no longer staff. S/he must be directly involved in strategy.”

There are 3 main forces for innovation in the next 2-3 decades, as is true for all the revs. 1. Deepening and widening of the ICT tech rev, responding to user needs. 2. The users of ICT across all industries and activities. 3. The gestation of the next rev (probably bioteech, nanotech, and new materials).

Big Data is likely have a big role in each of those directions.

Q: Why are we only 50% of the way through?

A: Because the change after the recession is like opening a dam. Once you get to the point where you can have a comfortable innovation prospective, imagine the market possibilities.

Q: What can go wrong?

A: Governments. Unfettered free markets are indispensable for the installation process. Lightly guided markets are needed in the golden age. Free markets work when you need to force everyone to change. But now no longer: The state has to come in . But govts are drunk with free markets. Now finance is incompetent. “They don’t dare invest in real things.” Ideology is so strong and the understanding of history is so shallow that we’re not doing the right thing.”

 


Christopher Ahlberg speaks now. He’s the founder of Recorded Future. His topic: “Turning the Web into Predictive Signals.”

We see events like Arab Spring and wonder if we could have predicted them. Three things are going on: 1. Moving from smaller to larger datasets. 2. From structured to unstructured data (from numbers to text). 3. From corporate data to Internet/Web.

There’s a “seismic shift in intelligence” “emporal indexing of the Web enables Web intelligence.” The Web is not organized for finding date; it’s about finding documents.” Can we create structure for the Web we can use for analysis? A lot of work has been done on this. Why is this possible now? Fast math, large, fast storage, web harvesting, and linguistic analysis progress.

His company looks for signals in human language. E.g., temporal signals. That can turn up competitive info. But human language is tough to deal with. But also when something happens — e.g., Haitian earthquake — there are patterns in when people show up: helpers, doctors, military, do-gooder actors, etc. There tends to be a flood of notifications immediately afterwards. The Recorded Data platform does the linguistic analysis.

He gives an example: What’s going to happen to Merck over the next 90 days. Some is predictable: There will be a quarterly financial conference all. A key drug is up for approval. Can we look into the public conversations about these events, and might this guide our stock purchases? And beyond Merck, we could look at everything from cyber attacks to sales opportunities.

Some examples. 1. Monitoring unrest. Last week there were protests against Foxconn in China. Analysis of Chinese media shows that most of those protests were inland, while corporate expansion is coming in coastal areas. Or look at protests against pharmaceuticals for animal testing.

Example 2: Analyzing cyber threats. Hackers often try out an approach on a small scale and then go larger. This can give us warning.

Example 3: Competitive intelligence. When is there a free space — announcement-free — when you can get some attention. Example 4: Lead generation. E.g., look for changes in management. (New marketing person might need a new PR agency.) Exasmple 5: Trading patterns. E.g., if there’s bad news but insiders are buying.

Conclusion: As we move from small to large datasets, structured to unstructured, and from inside to outside the company, we go from surprise to foresight.

Q: What is the question you cannot answer?

A: The situations that have low frequency. It’s important that there be an opportunity for follow-up questions.

Q: What if you don’t know what the right question is?

A: When it’s unknown unknowns, you can’t ask the right question. But the great thing about visualizaton is that it helps people ask questions.

Q: How to distinguish fact from opinion on Twitter, etc.?

A: Or NYT vs. Financial Post. There isn’t a simple answer. We’re working toward being able to judge sources based on known outcomes.

Q: Do your predictions get more accurate the more data you have?

A: Generally yes, but it’s not always that simple.

Be the first to comment »

August 27, 2012

Big Data on broadband

Google commissioned the compiling of

an international dataset of retail broadband Internet connectivity prices. The result was an international dataset of 3,655 fixed and mobile broadband retail price observations, with fixed broadband pricing data for 93 countries and mobile broadband pricing data for 106 countries. The dataset can be used to make international comparisons and evaluate the efficacy of particular public policies—e.g., direct regulation and oversight of Internet peering and termination charges—on consumer prices.

The links are here. WARNING: a knowledgeable friend of mine says that he has already found numerous errors in the data, so use them with caution.

Be the first to comment »

July 7, 2012

[2b2k] Big Data needs Big Pipes

A post by Stacy Higginbotham at GigaOm talks about the problems moving Big Data across the Net so that it can be processed. She draws on an article by Mari Silbey at SmartPlanet. Mari’s example is a telescope being built on Cerro Pachon, a mountain in Chile, that will ship many high-resolution sky photos every day to processing centers in the US.

Stacy discusses several high-speed networks, and the possibility of compressing the data in clever ways. But a person on a mailing list I’m on (who wishes to remain anonymous) pointed to GLIF, the Global Lambda Integrated Facility, which rather surprisingly is not a cover name for a nefarious organization out to slice James Bond in two with a high-energy laser pointer.

The title of its “informational brochure” [pdf] is “Connecting research worldwide with lightpaths,” which helps some. It explains:

GLIF makes use of the cost and capacity advantages offered by optical multiplexing, in order to build an infrastructure that can take advantage of various processing, storage and instrumentation facilities around the world. The aim is to encourage the shared use of resources by eliminating the traditional performance bottlenecks caused by a lack of network capacity.

Multiplexing is the carrying of multiple signals at different wavelengths on a single optical fiber. And these wavelengths are known as … wait for it … lambdas. Boom!

My mailing list buddy says that GLIF provides “100 gigabit optical waves”, which compares favorably to your pathetic earthling (um, American) 3-20 megabit broadband connection,(maybe 50mb if you have FIOS), and he notes that GLIF is available in Chile.

To sum up: 1. Moving Big Data is an issue. 2. We are not at the end of innovating. 3. The bandwidth we think of as “high” in the US is a miserable joke.


By the way, you can hear an uncut interview about Big Data I did a few days ago for Breitband, a German radio program that edited, translated, and broadcast it.

2 Comments »

March 31, 2012

[2b2k] The commoditizing and networking of facts

Ars Technica has a post about Wikidata, a proposed new project from the folks that brought you Wikipedia. From the project’s introductory page:

Many Wikipedia articles contain facts and connections to other articles that are not easily understood by a computer, like the population of a country or the place of birth of an actor. In Wikidata you will be able to enter that information in a way that makes it processable by the computer. This means that the machine can provide it in different languages, use it to create overviews of such data, like lists or charts, or answer questions that can hardly be answered automatically today.

Because I had some questions not addressed in the Wikidata pages that I saw, I went onto the Wikidata IRC chat (http://webchat.freenode.net/?channels=#wikimedia-wikidata) where Denny_WMDE answered some questions for me.

[11:29] hi. I’m very interested in wikidata and am trying to write a brief blog post, and have a n00b question.

[11:29] go ahead!

[11:30] When there’s disagreement about a fact, will there be a discussion page where the differences can be worked through in public?

[11:30] two-fold answer

[11:30] 1. there will be a discussion page, yes

[11:31] 2. every fact can always have references accompanying it. so it is not about “does berlin really have 3.5 mio people” but about “does source X say that berlin has 3.5 mio people”

[11:31] wikidata is not about truth

[11:31] but about referenceable facts

When I asked which fact would make it into an article’s info box when the facts are contested, Denny_WMDE replied that they’re working on this, and will post a proposal for discussion.

So, on the one hand, Wikidata is further commoditizing facts: making them easier and thus less expensive to find and “consume.” Historically, this is a good thing. Literacy did this. Tables of logarithms did it. Almanacs did it. Wikipedia has commoditized a level of knowledge one up from facts. Now Wikidata is doing it for facts in a way that not only will make them easy to look up, but will enable them to serve as data in computational quests, such as finding every city with a population of at least 100,000 that has an average temperature below 60F.

On the other hand, because Wikidata is doing this commoditizing in a networked space, its facts are themselves links — “referenceable facts” are both facts that can be referenced, and simultaneously facts that come with links to their own references. This is what Too Big to Know calls “networked facts.” Those references serve at least three purposes: 1. They let us judge the reliability of the fact. 2. They give us a pointer out into the endless web of facts and references. 3. They remind us that facts are not where the human responsibility for truth ends.

4 Comments »

February 3, 2012

[tech@state][2b2k] Real-time awareness

At the Tech@State conf, a panel is starting up. Participants: Linton Wells (National Defense U), Robert Bectel (CTO, Office of Energy Efficiency), Robert Kirkpatrick (Dir., UN Global Pulse), Ahmed Al Omran (NPR and Suadi blogger), and Clark Freifield (HealthMap.org).

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Robert Bectel brought in Netvibes.com [I use NetVibes as my morning newspaper.] to bring real-time into to his group’s desktop. It’s customized to who they are and what they do. They use Netvibes as their portal. They bring in streaming content, including YouTube and Twitter. What happens when people get too much info? So, they’re building analytics so people get info summarized in bar charts, etc. Even video analytics, analyzing video content. They asked what people wanted and built a food cart tracker. Or the shuttle bus. Widgets bring functionality within the window. They’re working on single sign-on. There’s some gamification. They plan on adding doc mgt, SharePpoint access, links to Federal Social Network.

Even better, he says, is that the public now can get access to the “wicked science” the DOE does. Make the data available. Go to IMBY, put in your zip code, and it will tell you what your solar resource potential is and the tax breaks you’ll get. “We’re going to put that in your phone. “We’re creating leads for solar installers.” And geothermal heat pumps.

Robert Kirkpatrick works in the UN Sect’y Gen’ls office, called Global Pulse, which is an R&D lab trying to learn to take advantage of Big Data to improve human welfare. Now “We’re swimming in an ocean of real time data.” This data is generated passively and acively. If you look at what people say to one another and what people actually do, “we have the opportunity to look at these as sensor networks.” Businesses have been doing this for a long time. Can we begin to look at the patterns of data when people lose their job, get sick, pull their kids out of school to make ends meet? What patterns appear when our programs are working? Global pulse is working with the private sector as well. Robert hopes that big data and real-time awareness will enable them to move from waterfall development (staged, slow) to agile (interative, fast).

Ahmed Al Omram says last year was a moment he and so many in the Middle East had been hoping for. He started blogging (SaudiJeans) seven years, even though the gov’t tried to silence him. “I wasn’t afraid because I knew I wasn’t alone.” He was part of a network of activists. Arab Spring did not happen overnight. “Activists and bloggers had been working together for ten years to make it happen.” “There’s no question in my mind that the Internet and social media played a huge role in what happened.” But there is much debate. E.g., Malcolm Gladwell argued that these revolutions would have happened anyway. But no one debates whether the Net changed how journalists covered the story. E.g., Andy Carvin live-tweeted the revolutions (aggregating and disseminating). Others, too. On Feb. 2 2010, Andy tweeted 1,400 times over 20 hours.

So, do we call this journalism? Probably. It’s a real-time news gathering operation happening in an open source newsroom. “The people who follow us are not our audience. They are part of an open newsroom. They are potential sources and fact-checkers.” E.g., the media carried a story during the war in Libya that the Libyan forces were using Israeli weapons. Andy and his followers debunked that in real time.

There is still a lot of work to do, he says.

Clark Friefield is a cofounder of healthmap, doing real time infectious disease tracking. He shows a chart of the stock price of a Chinese pharma that makes a product that’s believed to have antiviral properties. In Jan 2003, there was an uptick because of the beginning of SARS, which as not identified until Feb 2003. In traditional public health reporting, there’s a hierarchy. In the new model, the connections are much flatter. And there are many more sources of info, from tweets that are fast but tend to have more noise, and slower but more validated sources.

To organize the info better, in 2006 they reated a real-time mapping dashboard (free and open to the public). They collect 2000 reports a day, geotagged to 10,000 locations. They use named entity extractin to find disesases and locations. A bayesian filtering system are categorized with 91% accuracy. They assign significance to each event. The ones that make it through this filter make it to the map. Humans help to train the system.

During the H1N1 outbreak, they decided to create participatory epidemiology. They launched an iphone app called “Outbreaks Near Me” which let people submit reports as well as get alerts, which beame the #1 health and fitness app. They found that the rate of submissions tracked well with the CDC’s info. Also FluNearYou.org

Linton Wells now moderates a discussion:

Robert Bectel: DOE is getting a serious fire hose of info from the grid, and they don’t yet know what to do with it. So they’re thinking about releasing the 89B data points and asking the public what they want to do with it.

Robert Kirkpatrick: You need the wisdom of crowds, the instinct of experts, and the power of algorithms [quoting someone I missed]. And this flood of info is no longer a one-way stream; it’s interactive.

Ahmed: It helps to have people who speak the language and know the culture. But tech counts too: How about a twitter client that can detect tweets coming from a particular location. It’s a combo of both.

Clark: We use this combined approach. One initiative we’re working on builds on our smartphone app by letting us push questions out to people in a location where we have a suspicion that something is happening.

Linton: Security and verification?

Robert K: Info can be exploited, so this year we’re bringing together advisers on privacy and security.

Ahmed: People ask how you can trust random people to tell the truth, but many of them are well known to us. We use standard tools of trust, and we’ll also see who they’re following on Twitter, who’s following them, etc. It’s real-time verification.

Clark: In public health, the ability to get info is much better with an open Net than the old hierarchical flow of info.

Q: Are people trying to game the system?
A: Ahmed: Sure. GayGirlInDamascus turned out to be a guy in Moscow. But using the very same tools we managed to figure out who he was. But gov’ts will always try to push back. The gov’ts in Syria and Bahrain hired people to go online to change the narrative and discredit people. It’s always a challenge to figure out what’s the truth. But if you’ve worked in the field for a while, you can identify trusted sources. We call this “news sense.”
A: Clark: Not so much in public health. When there have been imposters and liars, there’s been a rapid debunking using the same tools.

Q:What incentives can we give for opening up corporate data?
A: Robert K: We call this data philanthropy but the private sector doesn’t see it that way. They don’t want their markets to fall into poverty; it’s business risk mitigation insurance. So there are some incentives there already.
A: Robert B: We need to make it possible for people to create apps that use the data.

Q: How about that Twitter censorship policy?
A: Ahmed: It’s censorship, but the way Twitter approached this was transparent, and some people is good for activists because they could have gone for a broader censorship policy; Twitter will only block in the country that demands it. In fact, Twitter lets you get around it by changing your location.

Q: How do we get Netvibes past the security concerns?
A: Robert B.: I’m a security geek. But employees need tools to be smarter. But we can define what tools you have access to.

Q: Clark, do you run into privacy issues?
A: Clark: Most of the data in HealthMap comes from publicly available sources.
A: Robert K: There are situations arising for which we do not have a framework. A child protection expert had just returned frmo a crisis where young kids on a street were tweeting about being abused at home. “We’re not even allowed to ask that question,” she said, “but if they’re telling the entire world, can we use that to begin to advocate for their rescue?” Our frameworks have not yet adapted to this new reality.

Linton: After the Arab Spring, how do we use data to help build enduring value?
A: Ahmed: It’s not the tech but how we use it.
A: Robert K: Real time analytics and visualizations provide many-to-many communications. Groups can see their beliefs, enabling a type of self-awareness not possible before. These tools have the possibility of creating new types of identity.
A: Robert B: To get twitter or Facebook smarter, you have to find different ways to use it. “Break it!” Don’t get stuck using today’s tech.

Linton: A 26-ear-old Al Jazeera reporter was at a conf “What’s the next big thing?” She replied, “I’m too old. Ask a high school student.”

Be the first to comment »

January 29, 2012

[2b2k] Big data, big apps

From Gigaom, five apps that could change Big Data.

Be the first to comment »

April 21, 2011

Big Data Models: Help me crowdsource sources

I’m thrilled that I’m going to be writing an article for Scientific American on big data models — models that cover some huge swath of life, such as the economy, the climate, sociopolitical change, etc. What’s the promise and what are the challenges? How far can such models scale?

So, who do you think I should interview? What projects strike you as particularly illuminating? Let me know in the comments, or at selfevident.com.


Thanks!

3 Comments »

March 4, 2011

[2b2k] Tagging big data

According to an article in Science Insider by Dennis Normile, a group formed at a symposium sponsored by the Board on Global Science and Technology, of the National Research Council, an arm of the U.S. National Academies [that's all they've got??] is proposing making it easier to find big scientific data sets by using a standard tag, along with a standard way of conveying the basic info about the nature of the set, and its terms of use. “The group hopes to come up with a protocol within a year that researchers creating large data sets will voluntarily adopt. The group may also seek the endorsement of the Internet Engineering Task Force…”

2 Comments »

Next Page »