Joho the Blog » big data

October 27, 2014

[liveblog] Christine Borgmann

Christine Borgman, chair of Info Studies at UCLA, and author of the essential Scholarship in the Digital Age, is giving a talk on The Knowledge Infrastructure of Astronomy. Her new book is Big Data, Little Data, No Data: Scholarship in the Networked World, but you’ll have to wait until January. (And please note that precisely because this is a well-organized talk with clearly marked sections, it comes across as choppy in these notes.)

NOTE: Live-blogging. Getting things wrong. Missing points.Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Her new book draws on 15 yrs of studying various disciplines and 7-8 years focusing on astronomy as a discipline. It’s framed around the change to more data-intensive research across the sciences and humanities plus, the policy push for open access to content and to data. (The team site.)

They’ve been looking at four groups:

The world thinks that astronomy and genomics have figured out how to do data intensive science, she says. But scientists in these groups know that it’s not that straightforward. Christine’s group is trying to learn from these groups and help them learn from one another

Knowledge Infrastructures are “string and baling wire.” Pieces pulled together. The new layered on top of the old.

The first English scientific journal began almost 350 yrs ago. (Philosophical Transactions of the Royal Academy.) We no longer think of the research object as a journal but as a set of articles, objects, and data. People don’t have a simple answer to what is their data. The raw files? The tables of data? When they’re told to share their data, they’re not sure what data is meant.”Even in astronomy we don’t have a single, crisp idea of what are our data.”

It’s very hard to find and organize all the archives of data. Even establishing a chronology is difficult. E.g., “Yes, that project has that date stamp but it’s really a transfer from a prior project twenty years older than that.” It’s hard to map the pieces.

Seamless Astronomy: ADS All Sky Survey, mapping data onto the sky. Also, they’re trying to integrate various link mappings, e.g., Chandra, NED, Simbad, WorldWide Telescope, Arxiv.org, Visier, Aladin. But mapping these collections doesn’t tell you why they’re being linked, what they have in common, or what are their differences. What kind of science is being accomplished by making those relationships? Christine hopes her project will help explain this, although not everyone will agree with the explanations.

Her group wants to draw some maps and models: “A Christmas Tree of Links!” She shows a variety of maps, possible ways of organizing the field. E.g., one from 5 yrs ago clusters services, repositories, archives and publishers. Another scheme: Publications, Objects, Observations; the connection between pubs (citations) and observations is the most loosely coupled. “The trend we’re seeing is that astronomy is making considerable progress in tying together the observations, publications, and data.” “Within astronomy, you’ve built many more pieces of your infrastructure than any other field we’ve looked at.”

She calls out Chris Erdmann [sitting immediately in front of me] as a leader in trying to get data curation and custodianship taken up by libraries. Others are worrying about bit-rot and other issues.

Astronomy is committed to open access, but the resource commitments are uneven.

Strengths of astronomy:

  • collaboration and openness.

  • International coordination.

  • Long term value of data.

  • Agreed standards.

  • Shared resources.

Gaps of astronomy:


  • Investment in data sstewardship: varies by mission and by type of research. E.g., space-based missions get more investment than the ground-based ones. (An audience member says that that’s because the space research was so expensive that there was more insistence on making the data public and usable. A lively discussion ensues…)


  • The access to data varies.


  • Curation of tools and technologies


  • International coordination. Sould we curate existing data? But you don’t get funding for using existing data. So, invest in getting new data from new instruments??


Christine ends with some provocative questions about openness. What does it mean exactly? What does it get us?


Q&A


Q: As soon as you move out of the Solar System to celestial astronomy, all the standards change.


A: When it takes ten years to build an instrument, it forces you to make early decisions about standards. But when you’re deploying sensors in lakes, you don’t always note that this is #127 that Eric put the tinfoil on top of because it wasn’t working well. Or people use Google Docs and don’t even label the rows and columns because all the readers know what they mean. That makes going back to it is much harder. “Making it useful for yourself is hard enough.” It’s harder still to make it useful for someone in 5 yrs, and harder still to make it useful for an unknown scientist in another country speaking another language and maybe from another discipline.


Q: You have to put a data management plan into every proposal, but you can’t make it a budget item… [There is a lively discussion of which funders reasonably fund this]


Q: Why does Europe fund ground-based data better than the US does?


A: [audience] Because of Riccardo Giacconi.

A: [Christine] We need to better fund the invisible workforce that makes science work. We’re trying to cast a light on this invisible infrastructure.

1 Comment »

June 18, 2014

[2b2k] The Despair of Knowledge

Jill Lepore has an excellent take-down in The New Yorker of Clay Christensen’s The Innovator’s Dilemma. Yet I am unconvinced.

I thought I was convinced when I read it. It’s a brilliantly done piece, examining Christensen’s evidence, questioning his methods, and drawing appropriate lessons, including wondering why we accepted the Innovator’s Dilemma for decades without critically examining it. (Christensen became so famous for it that his last name isn’t even flagged as a spelling error on my Mac.)

I got de-convinced by a discussion on a mailing list I’m on that points to some weaknesses in Lepore’s own argument, including her use of “cherry-picked” examples — a criticism she levels at Christensen — and her assumption that the continuity of companies, as opposed to their return on assets, is the right measure. As a person on the mailing list points out, John Hagel, John Seely Brown and Lang Davison take return on assets as a key metric in their book The Big Shift. And then someone else maintained that ROA is a poor measure of networked phenomena. That morphed into a discussion about the pragmatic value of truth: Does disruption provide a helpful framing for the New York Times as it considers its future?

The problem is that brains are truthy. They are designed to pay attention to things that seem to matter to us, bending our world around our concerns and interests. And brains are associative, so they make sense of the world — maybe even at the level of perception — by finding the relationships that seem to matter to us. In Heidegger’s terms, we are not indifferent knowing machines, but are creatures that care about what happens to us and to others. The brain is an unreliable narrator.

We now have access to an unfathomable sea of information that can contradict anything we settle on. That sea has been assembled by caring creatures and their minions, but it is so vast and global that it contains information beyond the caring and linking of any one of us. Every understanding can be subverted with a wink and a hand wave because all understanding simplifies a world that is resolutely and even necessarily complex. The universe outruns us.

Now we have machines that can look at masses of data and escape from our temptation to turn everything into a narrative. But those machines are limited by our decision about which data is worth gathering and connecting. There is hope in this direction, but it’s not clear whether we are capable of accepting the findings of machines that correlate without stories.

TL;DR: Our brains are truthy and the world is too big to make sense of. Not that that will stop us from trying.

 


[June 20:] Clay Christensen has cried foul in an interview.

14 Comments »

January 16, 2014

CityCodesAndOrdinances.xml

A friend is looking into the best way for a city to publish its codes and ordinances to make them searchable and reusable. What are the best schemas or ontologies to use?

I work in a law school library so you might think I’d know. Nope. So I asked a well-informed mailing list. Here’s what they have suggested, more or less in their own words:


Any other suggestions?

Be the first to comment »

December 24, 2013

Schema.org…now for datasets!

I had a chance to talk with Dan Brickley today, a semanticizer of the Web whom I greatly admire. He’s often referred to as a co-creator of FOAF, but these days he’s at Google working on Schema.org. He pointed me to the work Schema has been doing with online datasets, which I hadn’t been aware of. Very interesting.

Schema.org, as you probably know, provides a set of terms you can hide inside the HTML of your page that annotate what the visible contents are about. The major search engines — Google, Bing, Yahoo, Yandex — notice this markup and use it to provide more precise search results, and also to display results in ways that present the information more usefully. For example, if a recipe on a page is marked up with Schema.org terms, the search engine can identify the list of ingredients and let you search on them (“Please find all recipes that use butter but not garlic”) and display them in a more readable away. And of course it’s not just the search engines that can do this; any app that is looking at the HTML of a page can also read the Schema markup. There are Schema.org schemas for an ever-expanding list of types of information…and now datasets.

If you go to Schema.org/Dataset and scroll to the bottom where it says “Properties from Dataset,” you’ll see the terms you can insert into a page that talk specifically about the dataset referenced. It’s quite simple at this point, which is an advantage of Schema.org overall. But you can see some of the power of even this minimal set of terms over at Google’s experimental Schema Labs page where there are two examples.

The first example (click on the “view” button) does a specialized Google search looking for pages that have been marked up with Schema’s Dataset terms. In the search box, try “parking,” or perhaps “military.” Clicking on a return takes you to the original page that provides access to the dataset.

The second demo lets you search for databases related to education via the work done by LRMI (Learning Resource Metadata Initiative); the LRMI work has been accepted (except for the term useRightsUrl) as part of Schema.org. Click on the “view” button and you’ll be taken to a page with a search box, and a menu that lets you search the entire Web or a curated list. Choose “entire Web” and type in a search term such as “calculus.”

This is such a nice extension of Schema.org. Schema was designed initially to let computers parse information on human-readable pages (“Aha! ‘Butter’ on this page is being used as a recipe ingredient and on that page as a movie title“), but now it can be used to enable computers to pull together human-readable lists of available datasets.

I continue to be a fan of Schema because of its simplicity and pragmatism, and, because the major search engines look for Schema markup, people have a compelling reason to add markup to their pages. Obviously Schema is far from the only metadata scheme we need, nor does it pretend to be. But for fans of loose, messy, imperfect projects that actually get stuff done, Schema is a real step forward that keeps taking more steps forward.

Be the first to comment »

November 15, 2013

[liveblog] Noam Chomsky and Bart Gellman at Engaging Data

I’m at the Engaging Data 2013conference where Noam Chomsky and Pulitzer Prize winner (twice!) Barton Gellman are going to talk about Big Data in the Snowden Age, moderated by Ludwig Siegele of the Economist. (Gellman is one of the three people Snowden vouchsafed his documents with.) The conference aims at having us rethink how we use Big Data and how it’s used.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

LS: Prof. Chomsky, what’s your next book about?

NC: Philosophy of mind and language. I’ve been writing articles that are pretty skeptical about Big Data. [Please read the orange disclaimer: I’m paraphrasing and making errors of every sort.]

LS: You’ve said that Big Data is for people who want to do the easy stuff. But shouldn’t you be thrilled as a linguist?

NC: When I got to MIT at 1955, I was hired to work on a machine translation program. But I refused to work on it. “The only way to deal with machine translation at the current stage of understanding was by brute force, which after 30-40 years is how it’s being done.” A principled understanding based on human cognition is far off. Machine translation is useful but you learn precisely nothing about human thought, cognition, language, anything else from it. I use the Internet. Glad to have it. It’s easier to push some buttons on your desk than to walk across the street to use the library. But the transition from no libraries to libraries was vastly greater than the transition from librarites to Internet. [Cool idea and great phrase! But I think I disagree. It depends.] We can find lots of data; the problem is understanding it. And a lot of data around us go through a filter so it doesn’t reach us. E.g., the foreign press reports that Wikileaks released a chapter about the secret TPP (Trans Pacific Partnership). It was front page news in Australia and Europe. You can learn about it on the Net but it’s not news. The chapter was on Intellectual Property rights, which means higher prices for less access to pharmaceuticals, and rams through what SOPA tried to do, restricting use of the Net and access to data.

LS: For you Big Data is useless?

NC: Big data is very useful. If you want to find out about biology, e.g. But why no news about TPP? As Sam Huntington said, power remains strongest in the dark. [approximate] We should be aware of the long history of surveillance.

LS: Bart, as a journalist what do you make of Big Data?

BG: It’s extraordinarily valuable, especially in combination with shoe-leather, person-to-person reporting. E.g., a colleague used traditional reporting skills to get the entire data set of applicants for presidential pardons. Took a sample. More reporting. Used standard analytics techniques to find that white people are 4x more likely to get pardons, that campaign contributors are also more likely. It would be likely in urban planning [which is Senseable City Labs’ remit]. But all this leads to more surveillance. E.g., I could make the case that if I had full data about everyone’s calls, I could do some significant reporting, but that wouldn’t justify it. We’ve failed to have the debate we need because of the claim of secrecy by the institutions in power. We become more transparent to the gov’t and to commercial entities while they become more opaque to us.

LS: Does the availability of Big Data and the Internet automatically mean we’ll get surveillance? Were you surprised by the Snowden revelations>

NC: I was surprised at the scale, but it’s been going on for 100 years. We need to read history. E.g., the counter-insurgency “pacification” of the Philippines by the US. See the book by McCoy [maybe this. The operation used the most sophisticated tech at the time to get info about the population to control and undermine them. That tech was immediately used by the US and Britain to control their own populations, .g., Woodrow Wilson’s Red Scare. Any system of power — the state, Google, Amazon — will use the best available tech to control, dominate, and maximize their power. And they’ll want to do it in secret. Assange, Snowden and Manning, and Ellsberg before them, are doing the duty of citizens.

BG: I’m surprised how far you can get into this discussion without assuming bad faith on the part of the government. For the most part what’s happening is that these security institutions genuinely believe most of the time that what they’re doing is protecting us from big threats that we don’t understand. The opposition comes when they don’t want you to know what they’re doing because they’re afraid you’d call it off if you knew. Keith Alexander said that he wishes that he could bring all Americans into this huddle, but then all the bad guys would know. True, but he’s also worried that we won’t like the plays he’s calling.

LS: Bruce Schneier says that the NSA is copying what Google and Yahoo, etc. are doing. If the tech leads to snooping, what can we do about it?

NC: Govts have been doing this for a century, using the best tech they had. I’m sure Gen. Alexander believes what he’s saying, but if you interviewed the Stasi, they would have said the same thing. Russian archives show that these monstrous thugs were talking very passionately to one another about defending democracy in Eastern Europe from the fascist threat coming from the West. Forty years ago, RAND released Japanese docs about the invasion of China, showing that the Japanese had heavenly intentions. They believed everything they were saying. I believe these are universals. We’d probably find it for Genghis Khan as well. I have yet to find any system of power that thought it was doing the wrong thing. They justify what they’re doing for the noblest of objectives, and they believe it. The CEOs of corporations as well. People find ways of justifying things. That’s why you should be extremely cautious when you hear an appeal to security. It literally carries no information, even in the technical sense: it’s completely predictable and thus carries no info. I don’t doubt that the US security folks believe it, but it is without meaning. The Nazis had their own internal justifications.

BG: The capacity to rationalize may be universal, but you’ll take the conversation off track if you compare what’s happening here to the Stasi. The Stasi were blackmailing people, jailing them, preventing dissent. As a journalist I’d be very happy to find that our govt is spying on NGOs or using this power for corrupt self-enriching purposes.

NC: I completely agree with that, but that’s not the point: The same appeal is made in the most monstrous of circumstances. The freedom we’ve won sharply restricts state power to control and dominate, but they’ll do whatever they can, and they’ll use the same appeals that monstrous systems do.

LS: Aren’t we all complicit? We use the same tech. E.g., Prof. Chomsky, you’re the father of natural language processing, which is used by the NSA.

NC: We’re more complicit because we let them do it. In this country we’re very free, so we have more responsibility to try to control our govt. If we do not expose the plea of security and separate out the parts that might be valid from the vast amount that’s not valid, then we’re complicit because we have the oppty and the freedom.

LS: Does it bug you that the NSA uses your research?

NC: To some extent, but you can’t control that. Systems of power will use whatever is available to them. E.g., they use the Internet, much of which was developed right here at MIT by scientists who wanted to communicate freely. You can’t prevent the powers from using it for bad goals.

BG: Yes, if you use a free online service, you’re the product. But if you use a for-pay service, you’re still the product. My phone tracks me and my social network. I’m paying Verizon about $1,000/year for the service, and VZ is now collecting and selling my info. The NSA couldn’t do its job as well if the commercial entities weren’t collecting and selling personal data. The NSA has been tapping into the links between their data centers. Google is racing to fix this, but a cynical way of putting this is that Google is saying “No one gets to spy on our customers except us.”

LS: Is there a way to solve this?

BG: I have great faith that transparency will enable the development of good policy. The more we know, the more we can design policies to keep power in place. Before this, you couldn’t shop for privacy. Now a free market for privacy is developing as the providers now are telling us more about what they’re doing. Transparency allows legislation and regulation to be debated. The House Repubs came within 8 votes of prohibiting call data collection, which would have been unthinkable before Snowden. And there’s hope in the judiciary.

NC: We can do much more than transparency. We can make use of the available info to prevent surveillance. E.g., we can demand the defeat of TPP. And now hardware in computers is being designed to detect your every keystroke, leading some Americans to be wary of Chinese-made computers, but the US manufacturers are probably doing it better. And manufacturers for years have been trying to dsign fly-sized drones to collect info; that’ll be around soon. Drones are a perfect device for terrorists. We can learn about this and do something about it. We don’t have to wait until it’s exposed by Wikileaks. It’s right there in mainstream journals.

LS: Are you calling for a political movement?

NC: Yes. We’re going to need mass action.

BG: A few months ago I noticed a small gray box with an EPA logo on it outside my apartment in NYC. It monitors energy usage, useful to preventing brown outs. But it measures down to the apartment level, which could be useful to the police trying to establish your personal patterns. There’s no legislation or judicial review of the use of this data. We can’t turn back the clock. We can try to draw boundaries, and then have sufficient openness so that we can tell if they’ve crossed those boundaries.

LS: Bart, how do you manage the flow of info from Snowden?

BG: Snowden does not manage the release of the data. He gave it to three journalists and asked us to use your best judgment — he asked us to correct for his bias about what the most important stories are — and to avoid direct damage to security. The documents are difficult. They’re often incomplete and can be hard to interpret.

Q&A

Q: What would be a first step in forming a popular movement?

NC: Same as always. E.g., the women’s movement began in the 1960s (at least in the modern movement) with consciousness-raising groups.

Q: Where do we draw the line between transparency and privacy, given that we have real enemies?

BG: First you have to acknowledge that there is a line. There are dangerous people who want to do dangerous things, and some of these tools are helpful in preventing that. I’ve been looking for stories that elucidate big policy decisions without giving away specifics that would harm legitimate action.

Q: Have you changed the tools you use?

BG: Yes. I keep notes encrypted. I’ve learn to use the tools for anonymous communication. But I can’t go off the grid and be a journalist, so I’ve accepted certain trade-offs. I’m working much less efficiently than I used to. E.g., I sometimes use computers that have never touched the Net.

Q: In the women’s movement, at least 50% of the population stood to benefit. But probably a large majority of today’s population would exchange their freedom for convenience.

NC: The trade-off is presented as being for security. But if you read the documents, the security issue is how to keep the govt secure from its citizens. E.g., Ellsberg kept a volume of the Pentagon Papers secret to avoid affecting the Vietnam negotiations, although I thought the volume really only would have embarrassed the govt. Security is in fact not a high priority for govts. The US govt is now involved in the greatest global terrorist campaign that has ever been carried out: the drone campaign. Large regions of the world are now being terrorized. If you don’t know if the guy across the street is about to be blown away, along with everyone around, you’re terrorized. Every time you kill an Al Qaeda terrorist, you create 40 more. It’s just not a concern to the govt. In 1950, the US had incomparable security; there was only one potential threat: the creation of ICBM’s with nuclear warheads. We could have entered into a treaty with Russia to ban them. See McGeorge Bundy’s history. It says that he was unable to find a single paper, even a draft, suggesting that we do something to try to ban this threat of total instantaneous destruction. E.g., Reagan tested Russian nuclear defenses that could have led to horrible consequences. Those are the real security threats. And it’s true not just of the United States.

1 Comment »