Joho the Blogbig data Archives - Joho the Blog

October 12, 2016

[liveblog] Perception of Moral Judgment Made by Machines

I’m at the PAPIs conference where Edmond Awad [ twitter]at the MIT Media Lab is giving a talk about “Moral Machine: Perception of Moral Judgement Made by Machines.”

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

He begins with a hypothetical in which you can swerve a car to kill one person instead of stay on its course and kill five. The audience chooses to swerve, and Edmond points out that we’re utilitarians. Second hypothesis: swerve into a barrier that will kill you but save the pedestrians. Most of us say we’d like it to swerve. Edmond points out that this is a variation of the trolley problem, except now it’s a machine that’s making the decision for us.

Autonomous cars are predicted to minimize fatalities from accidents by 90%. He says his advisor’s research found that most people think a car should swerve and sacrifice the passenger, but they don’t want to buy such a car. They want everyone else to.

He connects this to the Tragedy of the Commons in which if everyone acts to maximize their good, the commons fails. In such cases, governments sometimes issue regulations. Research shows that people don’t want the government to regulate the behavior of autonomous cars, although the US Dept of Transportation is requiring manufacturers to address this question.

Edmond’s group has created the moral machine, a website that creates moral dilemmas for autonomous cars. There have been about two million users and 14 million responses.

Some national trends are emerging. E.g., Eastern countries tend to prefer to save passengers more than Western countries do. Now the MIT group is looking for correlations with other factors, e.g., religiousness, economics, etc. Also, what are the factors most crucial in making decisions?

They are also looking at the effect of automation levels on the assignment of blame. Toyota’s “Guardian Angel” model results in humans being judged less harshly: that mode has a human driver but lets the car override human decisions.


In response to a question, Edmond says that Mercedes has said that its cars will always save the passenger. He raises the possibility of the owner of such a car being held responsible for plowing into a bus full of children.

Q: The solutions in the Moral Machine seem contrived. The cars should just drive slower.

A: Yes, the point is to stimulate discussion. E.g., it doesn’t raise the possibility of swerving to avoid hitting someone who is in some way considered to be more worthy of life. [I’m rephrasing his response badly. My fault!]

Q: Have you analyzed chains of events? Does the responsibility decay the further you are from the event?

This very quickly gets game theoretical.

Be the first to comment »

October 11, 2016

[liveblog] Bas Nieland, Datatrix, on predicting customer behavior

At the PAPis conference Bas Nieland, CEO and Co-Founder of Datatrics, is talking about how to predict the color of shoes your customer is going to buy. The company tries to “make data science marketeer-proof for marketing teams of all sizes.” IT ties to create 360-degree customer profiles by bringing together info from all the data silos.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

They use some machine learning to create these profiles. The profile includes the buying phase, the best time to present choices to a user, and the type of persuasion that will get them to take the desired action. [Yes, this makes me uncomfortable.]

It is structured around a core API that talks to mongoDB and MySQL. They provide “workbenches” that work with the customer’s data systems. They use BigML to operate on this data.

The outcome are models that can be used to make recommendations. They use visualizations so that marketeers can understand it. But the marketeers couldn’t figure out how to use even simplified visualizations. So they created visual decision trees. But still the marketeers couldn’t figure it out. So they turn the data into simple declarative phrases: which audience they should contact, in which channel, what content, and when. E.g.:

“To increase sales, çontact your customers in the buying phase with high engagement through FB with content about jeans on sale on Thursday, around 10 o’clock.”

They predict the increase in sales for each action, and quantify in dollars the size of the opportunity. They also classify responses by customer type and phase.

For a hotel chain, they connected 16,000 variables and 21M data points, that got reduced to 75 variables by BigML which created a predictive model that ended up getting the chain more customer conversions. E.g., if the model says someone is in the orientation phase, the Web site shows photos of recommend hotels. If in the decision phase, the user sees persuasive messages, e.g., “18 people have looked at this room today.” The messages themselves are chosen based on the customer’s profile.

Coming up: Chatbot integration. It’s a “real conversation” [with a bot with a photo of an atttractive white woman who is supposedly doing the chatting]

Take-aways: Start simple. Make ML very easy to understand. Make it actionable.


Me: Is there a way built in for a customer to let your model know that it’s gotten her wrong. E.g., stop sending me pregnancy ads because I lost the baby.

Bas: No.

Me: Is that on the roadmap?

Bas: Yes. But not on a schedule. [I’m not proud of myself for this hostile question. I have been turning into an asshole over the past few years.]

Be the first to comment »

[liveblog] Vinny Senguttuvan on Predicting Customers

Vinny Senguttuvan is Senior Data Scientist at METIS. Before that, he was at Facebook-based gaming company, High 5 Games, which had 10M users. His talk at PAPIs: “Predicting Customers.”

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

The main challenge: Most of the players play for free. Only 2% ever spend money on the site, buying extra money to play. (It’s not gambling because you never cash out). 2% of those 2% contribute the majority of the revenue.

All proposed changes go through A/B testing. E.g., should we change the “Buy credits” button from blue to red. This is classic hypothesis testing. So you put up both options and see which gets the best results. It’s important to remember that there’s a cost to the change, so the A-B preference needs to be substantial enough. But often the differences are marginal. So you can increase the sample size. This complicates the process. “A long list of changes means not enough time per change.” And you want to be sure that the change affects the paying customers positively, which means taking even longer.

When they don’t have enough samples, they can bring down the confidence level required to make the change. Or they could bias one side of the hypothesis. And you can assume the variables are independent and run simultaneous A-B tests on various variables. High 5 does all three. It’s not perfect but it works.

Second, there is a poularity metric by which they rank or classify their 100 games. They constantly add games — it went from 15 to 100 in two years. This continuously changes the ranking of the games. Plus, some are launched locked. This complicates things. Vinny’s boss came up with a model of an n-dimensional casino, but it was too complex. Instead, they take 2 simple approaches: 1. An average-weighted spin. 2. Bayesian. Both predicted well but had flaws, so they used a type of average of both.

Third: Survival analysis. They wanted to know how many users are still active a given time after they created their account, and when is a user at risk of discontinuing use. First, they grouped users into cohorts (people who joined within a couple of weeks of each other) and plotted survival rates over time. They also observed return rates of users after each additional day of absence. They also implement a Cox survival model. They found that newer users were more likely to decline in their use of the product; early users are more committed. This pattern is widespread. That means they have to continuously acquire new players. They also alert users when they reach the elbow of disuse.

Fourth: Predictive lifetime value. Lifetime value = total revenue from a user over the entire time the the produced. This is significant because of costs: 10-15% of the rev goes into ads to acquire customers. Their 365 day prediction model should be a time series, but they needed results faster, so they flipped it into a regression problem, predicting the 365 day revenue based on the user’s first month data: how they spent, purchase count, days of play, player level achievement, and the date joined. [He talks about regression problems, but I can’t keep up.] At that point it cost $2 to acquire a customer from FB ad, and $6 from mobile apps. But when they tested, the mobile acquisitions were more profitable than those that came from through FB. It turned out that FB was counting as new users any player who hadn’t played in 30 days, and was re-charging them for it. [I hope I got that right.]

Fifth: Recommendation systems. Pandora notes the feature of songs and uses this to recommend similarities. YouTube makes recommendations made based on relations among users. Non-matrix factorization [I’m pretty sure he just made this up] gives you the ability to predict the score for a video that you know nothing about in terms of content. But what if the ratings are not clearly defined? At High 5, there are no explicit ratings. They calculated a rating based on how often a player plays it, how long the session, etc. And what do you do about missing values: use averages. But there are too many zeroes in the system, so they use sparse matrix solvers. Plus, there is a semi-order to the games, so they used some human input. [Useful for library Stackscores

Be the first to comment »

[liveblog] First panel: Building intelligent applications with machine learning

I’m at the PAPIs conference. The opening panel is about building intelligent apps with machine learning. The panelists are all representing companies. It’s Q&A with the audience; I will not be able to keep up well.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

The moderator asks one of the panelists (Snejina Zacharia from Insurify) how AI can change a heavily regulated audience such as insurance. She replies that the insurance industry gets low marks for customer satisfaction, which is an opportunity. Also, they can leverage the existing platforms and build modern APIs on stop of them. Also, they can explore how to use AI in existing functions, e.g., chatbots, systems that let users just confirm their identification rather than enter all the data. They also let users pick from an AI-filtered list of carriers that are right for them. Also, personalization: predicting risk and adjusting the questionnaire based on the user’s responses.

Another panelist is working on mapping for a company that is not Google and that is owned by three car companies. So, when an Audi goes over a bump, and then a Mercedes goes over it, it will record the same data. On personalization: it’s ripe for change. People are talking about 100B devices being connected by 2020. People think that RFID tags didn’t live up to their early hype, but 10 billion RFID tags are going to be sold this year. These can provide highly personalized, higher relevant data. This will be the base for the next wave of apps. We need a standards body effort, and governments addressing privacy and security. Some standards bodies are working on it, e.g., Global Standards 1, which manages the barcodes standard.

Another panelist: Why is marketing such a good opportunity for AI and ML? Marketers used to have a specific skill set. It’s an art: writing, presenting, etc. Now they’re being challenged by tech and have to understand data. In fact, now they have to think like scientists: hypothesize, experiment, redo the hypothesis… And now marketers are responsible for revenue. Being a scientist responsible for predictable revenue is driving interest in AI and ML. This panelist’s company uses data about companies and people to segmentize following up on leads, etc. [Wrong place for a product pitch, IMO, which is a tad ironic, isn’t it?]

Another panelist: The question is: how can we use predictive intelligence to make our applications better? Layer input intelligence on top of input-programming-output. For this we need a platform that provides services and is easy to attach to existing processes.

Q: Should we develop cutting edge tech or use what Google, IBM, etc. offer?

A: It depends on whether you’re an early adopter or straggler. Regulated industries have to wait for more mature tech. But if your bread and butter is based on providing the latest and greatest, then you should use the latest tech.

A: It also depends on whether you’re doing a vertically integrated solution or something broader.

Q: What makes an app “smart”? Is it: Dynamic, with rapidly changing data?

A: Marketers use personas, e.g., a handful of types. They used to be written in stone, just about. Smart apps update the personas after ever campaign, every time you get new info about what’s going on in the market, etc.

Q: In B-to-C marketing, many companies have built the AI piece for advertising. Are you seeing any standardization or platforms on top of the advertising channels to manage the ads going out on them?

A: Yes, some companies focus on omni-channel marketing.

A: Companies are becoming service companies, not product companies. They no longer hand off to retailers.

A: It’s generally harder to automate non-digital channels. It’s harder to put a revenue number on, say, TV ads.

Be the first to comment »

[liveblog] PAPIs: Cynthia Rudin on Regulating Greed

I’m at the PAPIs (Predictive Applications and APIS) [twitter: papistotio] conference at the NERD Center in Cambridge.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

The first speaker is Cynthia Rudin, Director of the Prediction Analysis Lab at MIT. Her topic is “Regulating Greed over Time: An Important Lesson for Practical Recommender Systems.” It’s about her Lab’s entry in a data mining competition. (The entry did not win.) The competition was to design a better algorithm for Yahoo’s recommendation of articles. To create an unbiased data set they showed people random articles for two weeks. Your algorithm had to choose to show one of the pool of articles to a user. To evaluate a recommender system, they’d check if your algorithm recommended the same thing that was shown to the user. If the user clicked on it, you could get an evaluation. [I don’t think I got this right.] If so, you sent your algorithm to Yahoo, and they evaluated its clickthrough rate; you never got access to Yahoo’s data.

This is, she says, a form of the multi-arm bandit problem: one arm is better (more likely to lead to a pay out) but you don’t know which one. So you spend your time figuring out which arm is the best, and then you only pull that one. Yahoo and Microsoft are among the companies using multi-arm bandit systems for recommendation systems. “They’re a great alternative to massive A-B testing

] [No, I don’t understand this. Not Cynthia’s fault!.].

Because the team didn’t have access to Yahoo’s data, they couldn’t tune their algorithms to it. Nevertheless, they achieved a 9% clickthrough rate … and still lost (albeit by a tiny margin). Cynthia explains how they increased the efficiency of their algorithms, but it’s math so I can only here play the sound of a muted trumpet. But it involves “decay exploration on the old articles,” and a “peak grabber”: If any articles gets more than 9 clicks out of the last 100 times they show the article, and they keep displaying it: if you have a good article, grab it. The dynamic version of a Peak Grabber had them continuing to showing a peak article if it had a clickthrough rate 14% above the global clickthrough rate.

“We were adjusting the exploration-exploitation tradeoff based on trends.” Is this a phenomenon worth exploring?The phenomenon: you shouldn’t always explore. There are times when you should just stop and exploit the flowers.

Some data supports this. E.g., in England, on Boxing Day you should be done exploring and just put your best prices on things — not too high, not too low. When the clicks on your site are low, you should be exploring. When high, maybe not. “Life has patterns.” The Multiarm Bandit techniques don’t know about these patterns.

Her group came up with a formal way of putting this. At each time there is a known reward multiplier: G(t). G is like the number of people in the store. When G is high, you want to exploit, not explore. In the lower zones you want to balance exploration and exploitation.

So they created two theorems, each leading to an algorithm. [She shows the algorithm. I can’t type in math notation that fast..]

Be the first to comment »

September 21, 2016

[iab] Frances Donegan-Ryan

At the IAB conference, Frances Donegan-Ryan from Bing begins by reminding us of the history of online search.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

We all leave digital footprints, she says. Every time we search, data is recorded. The sequence of our searches gives especially useful information to help the engine figure out what you’re trying to find out. Now the engines can refer to social graphs.

“But what do we do with data?”

Bing Predicts
looks at all the data it can in order to make predictions. It began by predicting the winners and losers in American Idol, and got it 100% right. For this election year, it tried to predict who would win each state primary or caucus in the US. Then it took in sentiment data to figure out which issues matter in each state, broken down by demographic groups.

Now, for example, it can track a new diabetes drug through the places people visit when logged into their browser. This might show that there are problems with the drug; consider for example people searching for unexpected side effects of it. Bing shares the result of this analysis with the CDC. [The acoustics where I was sitting was poor. I’m not sure I got this right.]

They’re doing the same for retail products, and are able to tell which will be the big sellers.

Frances talks about Cortana, “the only digital system that works across all platforms.” Microsoft is working on many more digital assistants — Bots
— that live within other services. She shows a temporary tattoo
made from gold leaf that you can use as a track pad, and other ways; this came out of MIT.

She says that the Microsoft version of a Fitbit can tell if you’re dehydrated or tired, and then can point you to the nearest place with water and a place to sit. Those shops could send you a coupon.

She goes quickly over the Hololens since Robert Scoble covered it so well this morning.

She closes with a story about using sensor data to know when a cow is in heat, which, it turns out, correlates with them walking faster. Then the data showed at what point in the period of fertility a male or female cow is likely to be conceived. Then they started using genetic data to predict genetically disabled calves.

It takes enormous computing power to do this sort of data analysis.

Be the first to comment »

[iab] Privacy discussion

I’m at the IAB conference in Toronto. Canada has a privacy law, PIPEDA law (The Personal Information Protection and Electronic Documents Act) passed in 2001, based on OECD principles.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Barbara Bucknell
, the director of policy and research at Office of the Privacy Commissioner where she worries about how to protect privacy while being able to take advantage of all the good stuff data can do.

A recent large survey found that more than half of Canadians are more concerned about privacy than they were last year. Only 34% think the govt is doing enough to keep their privacy safe. Globally, 8 out of 10 are worried about their info being bought, sold, or monitored. “Control is the key concern here.” “They’re worried about surprises: ‘Oh, I didn’t know you were using my information that way!'”

Adam Kardash [this link
?] says that all the traditional approaches to privacy have be carefully reconsidered. E.g., data minimization says you only collect what you need. “It’s a basic principle that’s been around forever.” But data scientists, when asked how much data they need for innovation, will say “We need it all.” Also, it’s incredibly difficult to explain how your data is going to be used, especially at the grade 6-7 literacy rate that is required. And for data retention, we should keep medical info forever. Marketers will tell you the same thing so they can give you information about you what you really need.

Adam raises the difficulties with getting consent, which the OPC opened a discussion about. Often asking for consent is a negligible part of the privacy process. “The notion of consent is having an increasingly smaller role” while the question of control is growing.

He asks Barbara “How does PEPIDA facility trust?”

Barbara: It puts guardrails into the process. They may be hard implement but they’re there for a reason. The original guidelines from the OECD were prescient. “It’s good to remember there were reasons these guardrails were put in place.”

Consent remains important, she says, but there are also other components, including accountability. The organization has to protect data and be accountable for how it’s used. Privacy needs to be built into services and into how your company is organized. Are the people creating the cool tech talking to the privacy folks and to the legal folks? “Is this conversation happening at the front end?” You’d be surprised how many organizations don’t have those kind of programs in place.

Barbara: Can you talk to the ethical side of this?

Adam: Companies want to know how to be respectful as part of their trust framework, not just meeting the letter of the law. “We believe that the vast majority of Big Data processing can be done within the legal framework. And then we’re creating a set of questions” in order for organisations to feel comfortable that what they’re doing is ethical. This is very practical, because it forestalls law suits. PEPIDA says that organizations can only process data for purposes a reasonable person would consider appropriate. We think that includes the ethical concerns.

Adam: How can companies facilitate trust?

Barbara: It’s vital to get these privacy management programs into place that will help facilitate discussions of what’s not just legal but respectful. And companies have to do a better job of explaining to individuals how they’re using their data.

Be the first to comment »

March 1, 2016

[berkman] Dries Buytaert

I’m at a Berkman [twitter: BerkmanCenter] lunchtime talk (I’m moderating, actually) where Dries Buytaert is giving a talk about some important changes in the Web.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

He begins by recounting his early days as the inventor of Drupal, in 2001. He’s also the founder of Acquia, one of the fastest growing tech companies in the US. It currently has 750 people working on products and services for Drupal. Drupal is used by about 3% of the billion web sites in the world.

When Drupal started, he felt he “could wrap his arms” around everything going on on the Web. Now that’s impossible, he says. E.g, Google AdWords were just starting, but now AdWords is a $65B business. The mobile Web didn’t exist. Social media didn’t yet exist. Drupal was (and is) Open Source, a concept that most people didn’t understand. “Drupal survived all of these changes in the market because we thought ahead” and then worked with the community.

“The Internet has changed dramatically” in the past decade. Big platforms have emerged. They’re starting to squeeze smaller sites out of the picture. There’s research that shows that many people think that Facebook is the Internet. “How can we save the open Web?,” Dries askes.

What do we mean by the open or closed Web? The closed Web consists of walled gardens. But these walled gardens also do some important good things: bringing millions of people online, helping human rights and liberties, and democratizing the sharing of information. But, their scale is scary . FB has 1.6B active users every month; Apple has over a billion IoS devices. Such behemoths can shape the news. They record data about our behavior, and they won’t stop until they know everything about us.

Dries shows a table of what the different big platforms know about us. “Google probably knows the most about us” because of gMail.

The closed web is winning “because it’s easier to use.” E.g., After Dries moved from Belgium to the US, Facebook and etc. made it much easier to stay in touch with his friends and family.

The open web is characterized by:

  1. Creative freedom — you could create any site you wanted and style it anyway you pleased

  2. Serendipity. That’s still there, but it’s less used. “We just scroll our FB feed and that’s it.”

  3. Control — you owned your own data.

  4. Decentralized — open standards connected the pieces

Closed Web:

  1. Templates dictate your creative license

  2. Algorithms determine what you see

  3. Privacy is in question

  4. Information is siloed

The big platforms are exerting control. E.g., Twitter closed down its open API so it could control the clients that access it. FB launched “Free Basics” that controls which sites you can access. Google lets people purchase results.

There are three major trends we can’t ignore, he says.

First, there’s the “Big Reverse of the Web,” about which Dries has been blogging about. “We’re in a transformational stage of the Web,” flipping it on its head. We used to go to sites and get the information we want. Now information is coming to us. Info, products, and services will come to us at the right time on the right device.”

Second, “Data is eating the world.”

Third, “Rise of the machines.”

For example, “content will find us,” AKA “mobile or contextual information.” If your flight is cancelled, the info available to you at the airport will provide the relevant info, not offer you car rentals for when you arrive. This creates a better user experience, and “user experience always wins.”

Will the Web be open or closed? “It could go either way.” So we should be thinking about how we can build data-driven, user-centric algorithms. “How can we take back control over our data?” “How can we break the silos” and decentralize them while still offering the best user experience. “How do we compete with Google in a decentralized way. Not exactly easy.”

For this, we need more transparency about how data is captured and used, but also how the algorithms work. “We need an FDA for data and algorithms.” (He says he’s not sure about this.) “It would be good if someone could audit these algorithms,” because, for example, Google’s can affect an election. But how to do this? Maybe we need algorithms to audit the algorithms?

Second, we need to protect our data. Perhaps we should “build personal information brokers.” You unbundle FB and Google, put the data in one place, and through APIs give apps access to them. “Some organizations are experimenting with this.”

Third, decentralization and a better user experience. “For the open web to win, we need to be much better to use.” This is where Open Source and open standards come in, for they allow us to build a “layer of tech that enables different apps to communicate, and that makes them very easy to use.” This is very tricky. E.g., how do you make it easy to leave a comment on many different sites without requiring people to log in to each?

It may look almost impossible, but global projects like Drupal can have an impact, Dries says. “We have to try. Today the Web is used by billions of people. Tomorrow by more people.” The Internet of Things will accelerate the Net’s effect. “The Net will change everything, every country, every business, every life.” So, “we have a huge responsibility to build the web that is a great foundation for all these people for decades to come.”

[Because I was moderating the discussion, I couldn’t capture it here. Sorry.]

Be the first to comment »

October 27, 2014

[liveblog] Christine Borgmann

Christine Borgman, chair of Info Studies at UCLA, and author of the essential Scholarship in the Digital Age, is giving a talk on The Knowledge Infrastructure of Astronomy. Her new book is Big Data, Little Data, No Data: Scholarship in the Networked World, but you’ll have to wait until January. (And please note that precisely because this is a well-organized talk with clearly marked sections, it comes across as choppy in these notes.)

NOTE: Live-blogging. Getting things wrong. Missing points.Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Her new book draws on 15 yrs of studying various disciplines and 7-8 years focusing on astronomy as a discipline. It’s framed around the change to more data-intensive research across the sciences and humanities plus, the policy push for open access to content and to data. (The team site.)

They’ve been looking at four groups:

The world thinks that astronomy and genomics have figured out how to do data intensive science, she says. But scientists in these groups know that it’s not that straightforward. Christine’s group is trying to learn from these groups and help them learn from one another

Knowledge Infrastructures are “string and baling wire.” Pieces pulled together. The new layered on top of the old.

The first English scientific journal began almost 350 yrs ago. (Philosophical Transactions of the Royal Academy.) We no longer think of the research object as a journal but as a set of articles, objects, and data. People don’t have a simple answer to what is their data. The raw files? The tables of data? When they’re told to share their data, they’re not sure what data is meant.”Even in astronomy we don’t have a single, crisp idea of what are our data.”

It’s very hard to find and organize all the archives of data. Even establishing a chronology is difficult. E.g., “Yes, that project has that date stamp but it’s really a transfer from a prior project twenty years older than that.” It’s hard to map the pieces.

Seamless Astronomy: ADS All Sky Survey, mapping data onto the sky. Also, they’re trying to integrate various link mappings, e.g., Chandra, NED, Simbad, WorldWide Telescope,, Visier, Aladin. But mapping these collections doesn’t tell you why they’re being linked, what they have in common, or what are their differences. What kind of science is being accomplished by making those relationships? Christine hopes her project will help explain this, although not everyone will agree with the explanations.

Her group wants to draw some maps and models: “A Christmas Tree of Links!” She shows a variety of maps, possible ways of organizing the field. E.g., one from 5 yrs ago clusters services, repositories, archives and publishers. Another scheme: Publications, Objects, Observations; the connection between pubs (citations) and observations is the most loosely coupled. “The trend we’re seeing is that astronomy is making considerable progress in tying together the observations, publications, and data.” “Within astronomy, you’ve built many more pieces of your infrastructure than any other field we’ve looked at.”

She calls out Chris Erdmann [sitting immediately in front of me] as a leader in trying to get data curation and custodianship taken up by libraries. Others are worrying about bit-rot and other issues.

Astronomy is committed to open access, but the resource commitments are uneven.

Strengths of astronomy:

  • collaboration and openness.

  • International coordination.

  • Long term value of data.

  • Agreed standards.

  • Shared resources.

Gaps of astronomy:

  • Investment in data sstewardship: varies by mission and by type of research. E.g., space-based missions get more investment than the ground-based ones. (An audience member says that that’s because the space research was so expensive that there was more insistence on making the data public and usable. A lively discussion ensues…)

  • The access to data varies.

  • Curation of tools and technologies

  • International coordination. Sould we curate existing data? But you don’t get funding for using existing data. So, invest in getting new data from new instruments??

Christine ends with some provocative questions about openness. What does it mean exactly? What does it get us?


Q: As soon as you move out of the Solar System to celestial astronomy, all the standards change.

A: When it takes ten years to build an instrument, it forces you to make early decisions about standards. But when you’re deploying sensors in lakes, you don’t always note that this is #127 that Eric put the tinfoil on top of because it wasn’t working well. Or people use Google Docs and don’t even label the rows and columns because all the readers know what they mean. That makes going back to it is much harder. “Making it useful for yourself is hard enough.” It’s harder still to make it useful for someone in 5 yrs, and harder still to make it useful for an unknown scientist in another country speaking another language and maybe from another discipline.

Q: You have to put a data management plan into every proposal, but you can’t make it a budget item… [There is a lively discussion of which funders reasonably fund this]

Q: Why does Europe fund ground-based data better than the US does?

A: [audience] Because of Riccardo Giacconi.

A: [Christine] We need to better fund the invisible workforce that makes science work. We’re trying to cast a light on this invisible infrastructure.

1 Comment »

June 18, 2014

[2b2k] The Despair of Knowledge

Jill Lepore has an excellent take-down in The New Yorker of Clay Christensen’s The Innovator’s Dilemma. Yet I am unconvinced.

I thought I was convinced when I read it. It’s a brilliantly done piece, examining Christensen’s evidence, questioning his methods, and drawing appropriate lessons, including wondering why we accepted the Innovator’s Dilemma for decades without critically examining it. (Christensen became so famous for it that his last name isn’t even flagged as a spelling error on my Mac.)

I got de-convinced by a discussion on a mailing list I’m on that points to some weaknesses in Lepore’s own argument, including her use of “cherry-picked” examples — a criticism she levels at Christensen — and her assumption that the continuity of companies, as opposed to their return on assets, is the right measure. As a person on the mailing list points out, John Hagel, John Seely Brown and Lang Davison take return on assets as a key metric in their book The Big Shift. And then someone else maintained that ROA is a poor measure of networked phenomena. That morphed into a discussion about the pragmatic value of truth: Does disruption provide a helpful framing for the New York Times as it considers its future?

The problem is that brains are truthy. They are designed to pay attention to things that seem to matter to us, bending our world around our concerns and interests. And brains are associative, so they make sense of the world — maybe even at the level of perception — by finding the relationships that seem to matter to us. In Heidegger’s terms, we are not indifferent knowing machines, but are creatures that care about what happens to us and to others. The brain is an unreliable narrator.

We now have access to an unfathomable sea of information that can contradict anything we settle on. That sea has been assembled by caring creatures and their minions, but it is so vast and global that it contains information beyond the caring and linking of any one of us. Every understanding can be subverted with a wink and a hand wave because all understanding simplifies a world that is resolutely and even necessarily complex. The universe outruns us.

Now we have machines that can look at masses of data and escape from our temptation to turn everything into a narrative. But those machines are limited by our decision about which data is worth gathering and connecting. There is hope in this direction, but it’s not clear whether we are capable of accepting the findings of machines that correlate without stories.

TL;DR: Our brains are truthy and the world is too big to make sense of. Not that that will stop us from trying.


[June 20:] Clay Christensen has cried foul in an interview.


Next Page »