Joho the Blog » big data

August 27, 2012

Big Data on broadband

Google commissioned the compiling of

an international dataset of retail broadband Internet connectivity prices. The result was an international dataset of 3,655 fixed and mobile broadband retail price observations, with fixed broadband pricing data for 93 countries and mobile broadband pricing data for 106 countries. The dataset can be used to make international comparisons and evaluate the efficacy of particular public policies—e.g., direct regulation and oversight of Internet peering and termination charges—on consumer prices.

The links are here. WARNING: a knowledgeable friend of mine says that he has already found numerous errors in the data, so use them with caution.

Follow me

Be the first to comment »

July 7, 2012

[2b2k] Big Data needs Big Pipes

A post by Stacy Higginbotham at GigaOm talks about the problems moving Big Data across the Net so that it can be processed. She draws on an article by Mari Silbey at SmartPlanet. Mari’s example is a telescope being built on Cerro Pachon, a mountain in Chile, that will ship many high-resolution sky photos every day to processing centers in the US.

Stacy discusses several high-speed networks, and the possibility of compressing the data in clever ways. But a person on a mailing list I’m on (who wishes to remain anonymous) pointed to GLIF, the Global Lambda Integrated Facility, which rather surprisingly is not a cover name for a nefarious organization out to slice James Bond in two with a high-energy laser pointer.

The title of its “informational brochure” [pdf] is “Connecting research worldwide with lightpaths,” which helps some. It explains:

GLIF makes use of the cost and capacity advantages offered by optical multiplexing, in order to build an infrastructure that can take advantage of various processing, storage and instrumentation facilities around the world. The aim is to encourage the shared use of resources by eliminating the traditional performance bottlenecks caused by a lack of network capacity.

Multiplexing is the carrying of multiple signals at different wavelengths on a single optical fiber. And these wavelengths are known as … wait for it … lambdas. Boom!

My mailing list buddy says that GLIF provides “100 gigabit optical waves”, which compares favorably to your pathetic earthling (um, American) 3-20 megabit broadband connection,(maybe 50mb if you have FIOS), and he notes that GLIF is available in Chile.

To sum up: 1. Moving Big Data is an issue. 2. We are not at the end of innovating. 3. The bandwidth we think of as “high” in the US is a miserable joke.

By the way, you can hear an uncut interview about Big Data I did a few days ago for Breitband, a German radio program that edited, translated, and broadcast it.

Follow me

Categories: broadband, science, too big to know Tagged with: 2b2k • big data • broadband Date: July 7th, 2012 dw

2 Comments »

March 31, 2012

[2b2k] The commoditizing and networking of facts

Ars Technica has a post about Wikidata, a proposed new project from the folks that brought you Wikipedia. From the project’s introductory page:

Many Wikipedia articles contain facts and connections to other articles that are not easily understood by a computer, like the population of a country or the place of birth of an actor. In Wikidata you will be able to enter that information in a way that makes it processable by the computer. This means that the machine can provide it in different languages, use it to create overviews of such data, like lists or charts, or answer questions that can hardly be answered automatically today.

Because I had some questions not addressed in the Wikidata pages that I saw, I went onto the Wikidata IRC chat (http://webchat.freenode.net/?channels=#wikimedia-wikidata) where Denny_WMDE answered some questions for me.

[11:29] hi. I’m very interested in wikidata and am trying to write a brief blog post, and have a n00b question.

[11:29] go ahead!

[11:30] When there’s disagreement about a fact, will there be a discussion page where the differences can be worked through in public?

[11:30] two-fold answer

[11:30] 1. there will be a discussion page, yes

[11:31] 2. every fact can always have references accompanying it. so it is not about “does berlin really have 3.5 mio people” but about “does source X say that berlin has 3.5 mio people”

[11:31] wikidata is not about truth

[11:31] but about referenceable facts

When I asked which fact would make it into an article’s info box when the facts are contested, Denny_WMDE replied that they’re working on this, and will post a proposal for discussion.

So, on the one hand, Wikidata is further commoditizing facts: making them easier and thus less expensive to find and “consume.” Historically, this is a good thing. Literacy did this. Tables of logarithms did it. Almanacs did it. Wikipedia has commoditized a level of knowledge one up from facts. Now Wikidata is doing it for facts in a way that not only will make them easy to look up, but will enable them to serve as data in computational quests, such as finding every city with a population of at least 100,000 that has an average temperature below 60F.

On the other hand, because Wikidata is doing this commoditizing in a networked space, its facts are themselves links — “referenceable facts” are both facts that can be referenced, and simultaneously facts that come with links to their own references. This is what Too Big to Know calls “networked facts.” Those references serve at least three purposes: 1. They let us judge the reliability of the fact. 2. They give us a pointer out into the endless web of facts and references. 3. They remind us that facts are not where the human responsibility for truth ends.

Follow me

Categories: experts, too big to know Tagged with: 2b2k • big data • facts • wikidata • wikipedia Date: March 31st, 2012 dw

4 Comments »

February 3, 2012

[tech@state][2b2k] Real-time awareness

At the Tech@State conf, a panel is starting up. Participants: Linton Wells (National Defense U), Robert Bectel (CTO, Office of Energy Efficiency), Robert Kirkpatrick (Dir., UN Global Pulse), Ahmed Al Omran (NPR and Suadi blogger), and Clark Freifield (HealthMap.org).

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Robert Bectel brought in Netvibes.com [I use NetVibes as my morning newspaper.] to bring real-time into to his group’s desktop. It’s customized to who they are and what they do. They use Netvibes as their portal. They bring in streaming content, including YouTube and Twitter. What happens when people get too much info? So, they’re building analytics so people get info summarized in bar charts, etc. Even video analytics, analyzing video content. They asked what people wanted and built a food cart tracker. Or the shuttle bus. Widgets bring functionality within the window. They’re working on single sign-on. There’s some gamification. They plan on adding doc mgt, SharePpoint access, links to Federal Social Network.

Even better, he says, is that the public now can get access to the “wicked science” the DOE does. Make the data available. Go to IMBY, put in your zip code, and it will tell you what your solar resource potential is and the tax breaks you’ll get. “We’re going to put that in your phone. “We’re creating leads for solar installers.” And geothermal heat pumps.

Robert Kirkpatrick works in the UN Sect’y Gen’ls office, called Global Pulse, which is an R&D lab trying to learn to take advantage of Big Data to improve human welfare. Now “We’re swimming in an ocean of real time data.” This data is generated passively and acively. If you look at what people say to one another and what people actually do, “we have the opportunity to look at these as sensor networks.” Businesses have been doing this for a long time. Can we begin to look at the patterns of data when people lose their job, get sick, pull their kids out of school to make ends meet? What patterns appear when our programs are working? Global pulse is working with the private sector as well. Robert hopes that big data and real-time awareness will enable them to move from waterfall development (staged, slow) to agile (interative, fast).

Ahmed Al Omram says last year was a moment he and so many in the Middle East had been hoping for. He started blogging (SaudiJeans) seven years, even though the gov’t tried to silence him. “I wasn’t afraid because I knew I wasn’t alone.” He was part of a network of activists. Arab Spring did not happen overnight. “Activists and bloggers had been working together for ten years to make it happen.” “There’s no question in my mind that the Internet and social media played a huge role in what happened.” But there is much debate. E.g., Malcolm Gladwell argued that these revolutions would have happened anyway. But no one debates whether the Net changed how journalists covered the story. E.g., Andy Carvin live-tweeted the revolutions (aggregating and disseminating). Others, too. On Feb. 2 2010, Andy tweeted 1,400 times over 20 hours.

So, do we call this journalism? Probably. It’s a real-time news gathering operation happening in an open source newsroom. “The people who follow us are not our audience. They are part of an open newsroom. They are potential sources and fact-checkers.” E.g., the media carried a story during the war in Libya that the Libyan forces were using Israeli weapons. Andy and his followers debunked that in real time.

There is still a lot of work to do, he says.

Clark Friefield is a cofounder of healthmap, doing real time infectious disease tracking. He shows a chart of the stock price of a Chinese pharma that makes a product that’s believed to have antiviral properties. In Jan 2003, there was an uptick because of the beginning of SARS, which as not identified until Feb 2003. In traditional public health reporting, there’s a hierarchy. In the new model, the connections are much flatter. And there are many more sources of info, from tweets that are fast but tend to have more noise, and slower but more validated sources.

To organize the info better, in 2006 they reated a real-time mapping dashboard (free and open to the public). They collect 2000 reports a day, geotagged to 10,000 locations. They use named entity extractin to find disesases and locations. A bayesian filtering system are categorized with 91% accuracy. They assign significance to each event. The ones that make it through this filter make it to the map. Humans help to train the system.

During the H1N1 outbreak, they decided to create participatory epidemiology. They launched an iphone app called “Outbreaks Near Me” which let people submit reports as well as get alerts, which beame the #1 health and fitness app. They found that the rate of submissions tracked well with the CDC’s info. Also FluNearYou.org

Linton Wells now moderates a discussion:

Robert Bectel: DOE is getting a serious fire hose of info from the grid, and they don’t yet know what to do with it. So they’re thinking about releasing the 89B data points and asking the public what they want to do with it.

Robert Kirkpatrick: You need the wisdom of crowds, the instinct of experts, and the power of algorithms [quoting someone I missed]. And this flood of info is no longer a one-way stream; it’s interactive.

Ahmed: It helps to have people who speak the language and know the culture. But tech counts too: How about a twitter client that can detect tweets coming from a particular location. It’s a combo of both.

Clark: We use this combined approach. One initiative we’re working on builds on our smartphone app by letting us push questions out to people in a location where we have a suspicion that something is happening.

Linton: Security and verification?

Robert K: Info can be exploited, so this year we’re bringing together advisers on privacy and security.

Ahmed: People ask how you can trust random people to tell the truth, but many of them are well known to us. We use standard tools of trust, and we’ll also see who they’re following on Twitter, who’s following them, etc. It’s real-time verification.

Clark: In public health, the ability to get info is much better with an open Net than the old hierarchical flow of info.

Q: Are people trying to game the system?
A: Ahmed: Sure. GayGirlInDamascus turned out to be a guy in Moscow. But using the very same tools we managed to figure out who he was. But gov’ts will always try to push back. The gov’ts in Syria and Bahrain hired people to go online to change the narrative and discredit people. It’s always a challenge to figure out what’s the truth. But if you’ve worked in the field for a while, you can identify trusted sources. We call this “news sense.”
A: Clark: Not so much in public health. When there have been imposters and liars, there’s been a rapid debunking using the same tools.

Q:What incentives can we give for opening up corporate data?
A: Robert K: We call this data philanthropy but the private sector doesn’t see it that way. They don’t want their markets to fall into poverty; it’s business risk mitigation insurance. So there are some incentives there already.
A: Robert B: We need to make it possible for people to create apps that use the data.

Q: How about that Twitter censorship policy?
A: Ahmed: It’s censorship, but the way Twitter approached this was transparent, and some people is good for activists because they could have gone for a broader censorship policy; Twitter will only block in the country that demands it. In fact, Twitter lets you get around it by changing your location.

Q: How do we get Netvibes past the security concerns?
A: Robert B.: I’m a security geek. But employees need tools to be smarter. But we can define what tools you have access to.

Q: Clark, do you run into privacy issues?
A: Clark: Most of the data in HealthMap comes from publicly available sources.
A: Robert K: There are situations arising for which we do not have a framework. A child protection expert had just returned frmo a crisis where young kids on a street were tweeting about being abused at home. “We’re not even allowed to ask that question,” she said, “but if they’re telling the entire world, can we use that to begin to advocate for their rescue?” Our frameworks have not yet adapted to this new reality.

Linton: After the Arab Spring, how do we use data to help build enduring value?
A: Ahmed: It’s not the tech but how we use it.
A: Robert K: Real time analytics and visualizations provide many-to-many communications. Groups can see their beliefs, enabling a type of self-awareness not possible before. These tools have the possibility of creating new types of identity.
A: Robert B: To get twitter or Facebook smarter, you have to find different ways to use it. “Break it!” Don’t get stuck using today’s tech.

Linton: A 26-ear-old Al Jazeera reporter was at a conf “What’s the next big thing?” She replied, “I’m too old. Ask a high school student.”

Follow me

Categories: liveblog, too big to know Tagged with: 2b2k • big data • liveblog • techatstate • techstate Date: February 3rd, 2012 dw

Be the first to comment »

January 29, 2012

[2b2k] Big data, big apps

From Gigaom, five apps that could change Big Data.

Follow me

Categories: too big to know Tagged with: 2b2k • big data Date: January 29th, 2012 dw

Be the first to comment »

April 21, 2011

Big Data Models: Help me crowdsource sources

I’m thrilled that I’m going to be writing an article for Scientific American on big data models — models that cover some huge swath of life, such as the economy, the climate, sociopolitical change, etc. What’s the promise and what are the challenges? How far can such models scale?

So, who do you think I should interview? What projects strike you as particularly illuminating? Let me know in the comments, or at selfevident.com.

Thanks!

Follow me

Categories: science, too big to know Tagged with: 2b2k • big data • sciam Date: April 21st, 2011 dw

3 Comments »

March 4, 2011

[2b2k] Tagging big data

According to an article in Science Insider by Dennis Normile, a group formed at a symposium sponsored by the Board on Global Science and Technology, of the National Research Council, an arm of the U.S. National Academies [that’s all they’ve got??] is proposing making it easier to find big scientific data sets by using a standard tag, along with a standard way of conveying the basic info about the nature of the set, and its terms of use. “The group hopes to come up with a protocol within a year that researchers creating large data sets will voluntarily adopt. The group may also seek the endorsement of the Internet Engineering Task Force…”

Follow me

Categories: everythingIsMiscellaneous, science, too big to know Tagged with: 2b2k • big data • everythingIsMiscellaneous • science • standards Date: March 4th, 2011 dw

2 Comments »

February 20, 2011

[2b2k] Public data and metadata, Google style

I’m finding Google Lab’s Dataset Publishing Language (DSPL) pretty fascinating.

Upload a set of data, and it will do some semi-spiffy visualizations of it. (As Apryl DeLancey points out, Martin Wattenberg and Fernanda Viegas now work for Google, so if they’re working on this project, the visualizations are going to get much better.) More important, the data you upload is now publicly available. And, more important than that, the site wants you to upload your data in Google’s DSPL format. DSPL aims at getting more metadata into datasets, making them more understandable, integrate-able, and re-usable.

So, let’s say you have spreadsheets of “statistical time series for unemployment and population by country, and population by gender for US states.” (This is Google’s example in its helpful tutorial.)

You would supply a set of concepts (“population”), each with a unique ID (“pop”), a data type (“integer”), and explanatory information (“name=population”, “definition=the number of human beings in a geographic area”). Other concepts in this example include country, gender, unemployment rate, etc. [Note that I’m not using the DSPL syntax in these examples, for purposes of readability.]
For concepts that have some known set of members (e.g., countries, but not unemployment rates), you would create a table — a spreadsheet in CSV format — of entries associated with that concept.
If your dataset uses one of the familiar types of data, such as a year, geographical position, etc., you would reference the “canonical concepts” defined by Google.
You create a “slice” or two, that is, “a combination of concepts for which data exists.” A slice references a table that consists of concepts you’ve already defined and the pertinent values (“dimensions” and “metrics” in Google’s lingo). For example, you might define a “countries slice” table that on each row lists a country, a year, and the country’s population in that year. This table uses the unique IDs specified in your concepts definitions.
Finally, you can create a dataset that defines topics hierarchically so that users can more easily navigate the data. For example, you might want to indicate that “population” is just one of several characteristics of “country.” Your topic dataset would define those relations. You’d indicate that your “population” concept is defined in the topic dataset by including the “population topic” ID (from the topic dataset) in the “population” concept definition.

When you’re done, you have a data set you can submit to Google Public Data Explorer, where the public can explore your data. But, more important, you’ve created a dataset in an XML format that is designed to be rich in explanatory metadata, is portable, and is able to be integrated into other datasets.

Overall, I think this is a good thing. But:

While Google is making its formats public, and even its canonical definitions are downloadable, DSPL is “fully open” for use, but fully Google’s to define. Having the 800-lbs gorilla defining the standard is efficient and provides the public platform that will encourage acceptance. And because the datasets are in XML, Google Public Data Explorer is not a roach motel for data. Still, it’d be nice if we could influence the standard more directly than via an email-the-developers text box.
Defining topics hierarchically is a familiar and useful model. I’m curious about the discussions behind the scenes about whether to adopt or at least enable ontologies as well as taxonomies.
Also, I’m surprised that Google has not built into this standard any expectation that data will be sourced. Suppose the source of your US population data is different from the source of your European unemployment statistics? Of course you could add links into your XML definitions of concepts and slices. But why isn’t that a standard optional element?
Further (and more science fictional), it’s becoming increasingly important to be able to get quite precise about the sources of data. For example, in the library world, the bibliographic data in MARC records often comes from multiple sources (local cataloguers, OCLC, etc.) and it is turning out to be a tremendous problem that no one kept track of who put which datum where. I don’t know how or if DSPL addresses the sourcing issue at the datum level. I’m probably asking too much. (At least Google didn’t include a copyright field as standard for every datum.)

Overall, I think it’s a good step forward.

Follow me

Categories: everythingIsMiscellaneous, too big to know Tagged with: 2b2k • big data • everythingIsMiscellaneous • google • metadata • standards • xml Date: February 20th, 2011 dw

1 Comment »

January 27, 2011

[2b2k] Guardian aggregates all its data

The Guardian has been publishing its data for the past couple of years. Now it is making all of it available in one spreadsheet:

Want to see all of the data we have reported? Here’s all the data we’ve covered over the last two years, that’s almost 600 spreadsheets linked from one spreadsheet

Not just transparnecy, but convenience! Well done, Guardian!

Follow me

Categories: journalism, media, too big to know Tagged with: 2b2k • big data • journallism Date: January 27th, 2011 dw

1 Comment »

« Previous Page