Joho the Blogbig data Archives - Joho the Blog

March 19, 2018

[liveblog] Kate Zwaard, on the Library of Congress Labs

Kate Zwaard (twitter: @kzwa) Chief of National Digital Strategies at the Library of Congress and leader of the LC Lab, is opening MIT Libraries’ Grand Challenge Summit..The next 1.5 days will be about the grand challenges in enabling scholarly discovery.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

For context she tells us that the LC is the largest library in the world, with 164M items. It has the world’s largest collection of film, maps, comic books, telephone directories, and more. [Too many for me to keep this post up with.]

  • You can wolk for two football fields just in the maps section. The world’s largest collection of recorded sound. The largest collection

  • Personal papers from Ben Franklin, Rosa Parks, Groucho Marx, Claude Shannon, and so many more.

  • Last year they circulated almost a million physical items.

  • Every week 11,000 tangible items come in through the Copyright office.

  • Last year, they digitized 4.7M iems, as well 730M documents crawled from the Web, plus much more. File count: 243M and growing every day.

These serve just one of the LC’s goal: “Acquire, preserve, and provide access to a universal collection of knowledge and the record of America’s creativity.” Not to mention serving Congress, and much more. [I can only keep up with a little of this. Kate’s a fantastic presenter and is not speaking too quickly. The LC is just too big!]

Kate thinks of the LC’s work as an exothermic reaction that needs an activation energy or catalyst. She leads the LC Labs, which started a year ago as a place of experimentation. The LC is a delicate machine, which makes it hard for it to change. The Labs enable experimentation. “Trying things that are easy and cheap is the only way forward.”

When thinking about what to do next, she things about what’s feasible and the impact. One way of having impact: demonstrating that the collection has unexplored potentials for research. She’s especially interested in how the Labs can help deal with the problem of scale at the LC.

She talks about some of Lab’s projects.

If you wanted to make stuff with LC data, there was no way of doing that. Now there’s LC for Robots, added documentation, and Jupyter Notebooks: an open source Web app that let you create open docs that contain code, running text, etc. It lets people play with the API without doing all the work from scratch.

But it’s not enough to throw some resources onto a Web page. The NEH data challenge asked people to create new things using the info about 12M newspapers in the collection. Now the Lab has the Congressional Data Challenge: do something with with Congressional data.

Labs has an Innovator in Residence project. The initial applicants came from LC to give it a try. One of them created a “Beyond Words” crowdsourcing project that asks them to add data to resources

Kate likes helping people find collections they otherwise would have missed. For ten years LC has collaborated wi the Flickr Commons. But they wanted to crowdsource a transcription project for any image of text. A repo will be going up on GitHub shortly for this.

In the second year of the Innovator in Residence, they got the artist Jer Thorp [Twitter: @blprnt] to come for 6 months. Kate talks about his work with the papers of Edward Lorenz, who coined the phrase “The Butterfly Effect.” Jer animated Lorenz’s attractor, which, he points out, looks a bit like a butterfly. Jer’s used the attractor on a collection of 3M words. It results in “something like a poem.” (Here’s Jer’s Artist in the Archive podcast about his residency.)

Jer wonders how we can put serendipity back into the LC and into the Web. “How do we enable our users to be carried off by curiousity not by a particular destination.” The LC is a closed stack library, but it can help guide digital wanderers. ”

Last year the LC released 25M catalog records. Jer did a project that randomly pulls the first names of 20 authors in any particular need. It demonstrates, among other things, the changing demographics of authors. Another project: “Birthy Deathy” that displays birthplace info. Antother looks for polymaths.

In 2018 the Lab will have their first open call for an Innovator in Residence. They’ll be looking for data journalists.

Kate talks about Laura Wrubel
‘s work with the Lab. “Library of Congress Colors” displays a graphic of the dominant colors in a collection.

Or Laura’s Photo Roulette: you guess the date of a photo.

Kate says she likes to think that libraries not just “book holes.” One project: find links among items in the archives. But the WARC format is not amenable to that.

The Lab is partnering with lots of great grops, including JSONstor and WikiData.

They’re working on using machine learning to identify place names in their photos.

What does this have to do with scale, she asks, nothng that the LC has done pretty well with scale. E.g., for the past seven years, the size of their digital collection has doubled every 32 months.

The Library also thinks about how to become a place of warmth and welcome. (She gives a shout out to MIT Libraries’ Future of Libraries
report). Right now, visitors and scholars go to different parts of the building. Visitors to the building see a monument to knowledge, but not a living, breathing place. “The Library is for you. It is a place you own. It is a home.”

She reads from a story by Ann Lamott.

How friendship relates to scale. “Everything good that has happened in my life has happened because of friendship.” The average length of employment of a current employee is thirty years. — that’s not the average retirement year. “It’s not just for the LC but for our field.” Good advice she got: “Pick your career by the kind of people you like to be around.” Librarians!

“We’ve got a tough road ahead of us. We’re still in the early days of the disruption that computation is going to bring to our profession.” “Friendship is what will get us through these hard times. We need to invite peopld into the tent.” “Everything we’ve accomplished has been through the generosity of our friends and colleagues.” This 100% true of the Labs. It’s ust 4 people, but everything they do is done in collaboration.

She concludes (paraphrasing badly): I don’t believe in geniuses, and i don’t believe in paradigm shirts. I believe in friendship and working together over the long term. [She put this far better.]

Q&A

Q: How does the Lab decide on projects?

A: Collaboratively

Q: I’m an archivist at MIT. The works are in closed stack, which can mislead people about the scale. How do we explain the scale in an interesting way.

A: Funding is difficult because so much of the money that comes is to maintain and grow the collection and services. It can be a challenge to carve out funding for experimentation and innovation. We’ve been working hard on finding ways to help people wrap their heads around the place.

Q: Data science students are eager to engage, e.g., as interns. How can academic institutions help to make that happen?

A: We’re very interested in what sorts of partnerships we can create to bring students in. The data is so rich, and the place is so interesting.

Q: Moving from models that think about data as packages as opposed to unpacking and integrating. What do you think about the FAIR principle: making things Findable, Accesible Interoperable, and Reusable? Also, we need to bring in professionals thinking about knowledge much more broadly.

I’m very interested in Hathi Trust‘s data capsules. Are there ways we can allow people to search through audio files that are not going to age into the commons until we’re gone? You’re right: the model of chunks coming in and out is not going to work for us.

Q: In academia, our focus has been to provide resources efficiently. How can weave in serendipity without hurting the efficiency?

A: That’s hard. Maybe we should just serve the person who has a specific purpose. You could give ancillary answers. And crowdsourcing could make a lot more available.

[Great talk.]

Comments Off on [liveblog] Kate Zwaard, on the Library of Congress Labs

November 1, 2016

[liveblog][bkc] Paola Villarreal on Public Interest in Data Science

I’m at a Berkman Klein Center lunch time talk by Paola Villarreal [twitter: paw], a BKC fellow, on “Public Interest in Data Science.” (Paola points to a github page for her project info.)

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.


Public interest, she says, is the effecting of changes in social policies in the interest of the public, especially for the underdog. Data science extracts knowledge and insight from data in various forms, using math, statistics, research, info science and computer science. “What happens if you put data and tech in the hands of civil liberties orgs, human rights activists, media outlets”What happens if you put data and tech in the hands of civil liberties orgs, human rights activists, media outlets, and governments? How might this effect liberty, justice, equality, and transparency and accountability?


She is going to talk about the Data for Justice project, which is supported by the Ford Foundation, the ACLU, and the Mozilla Foundation. The aim is to empower lawyers and advocates to make data-supported cases for improving justice in their communities.


The process: get the data, normalize it, process it, analyze it, visualize it … and then socialize it, inform change, and make it last! She cautions that it is crucial to make sure that you’ve identified the affected communities and that they’re involved in generating a solution. All the stakeholders should be involved in co-designing the solution.


Paola talks about the Annie Dookhan case. Dookhan was a chemist at a Massachusetts crime lab, who falsified evidence, possibly affecting 24,000 cases. Paola shows a table of data: the percentage of adults and juveniles convicted in drug cases and those whose evidence went through Dookhan. It’s a very high number: in some counties, over 25% of the drug convictions used possibly falsified data from Dookhan.


She shows a map of Boston that shows that marijuana-related police interactions occur mainly where people of color live. She plays a clip from marijuana,justiceos.org.


She lists her toolkit, which includes R, Stata, PostGIS, Ant (Augmented Narrative Toolkit),
and Tableau


But what counts is having an impact, she says. That means reaching out to journalists, community organizers, authorities, and lawmakers.


She concludes that data and tech do not do anything by themselves, and data scientists are only one part of a team with a common goal. The intersection of law and data is important. She concludes: Data and tech in the hands of people working with and for the public interest can have an impact on people’s lives.


Q&A

Q: Why are communities not more often involved?


A: It’s hard. It’s expensive. And data scientists are often pretty far removed from community organizing.


Q: Much of the data you’re referring to are private. How do you manage privacy when sharing the data?


A: In the Dookhan case, the data was impounded, and I used security measures. The Boston maps showing where incidents occurred smudged the info across a grid of about half a mile.


A: Kate Crawford talks about how important Paola’s research was in the Dookhan case. “It’s really valuable for the ACLU to have a scientist working on data like this.”


Q: What happened to the people who were tried with Dookhan evidence?


A: [ann] Special magistrates and special hearings were set up…


Q: [charlie nesson] A MOOC is considering Yes on 4 (marijuana legalization ballot question) and someone asked if there is a relationship between cannabis reform and Black Lives Matter. And you’ve answered that question. It’s remarkable that BLM hasn’t cottoned on to cannabis reform as a sister issue.


Q: I’ve been reading Cathy O’Neil‘s Weapons of Math Destruction [me too!] and I’m wondering if you could talk about your passion for social justice as a data scientist.


A: I’m Mexican. I learned to code when I was 12 because I had access to the Internet. I started working as a web developer at 15, and a few years later I was director of IT for the president’s office. I reflected on how I got that opportunity, and the answer was that it was thanks to open source. That inspired me.


Q: We are not looking at what happens to black women. They get criminalized even more often than black men. Also, has anyone looked at questions of environmental justice?


Q: How can we tell if a visualization is valid or is propaganda? Are there organizations doing this?


A: Great question, and I don’t know how to answer it. We publish the code, but of course not everyone can understand it. I’m not using AI or Deep Learning; I’m keeping it simple.


Q: What’s the next big data set you’re going to work on?


A: (She shows a visualization tool she developed that explores police budgets.)


Q: How do you work with journalists? Do you bring them in early?


A: We haven’t had that much interaction with them yet.

Comments Off on [liveblog][bkc] Paola Villarreal on Public Interest in Data Science

October 12, 2016

[liveblog] Perception of Moral Judgment Made by Machines

I’m at the PAPIs conference where Edmond Awad [ twitter]at the MIT Media Lab is giving a talk about “Moral Machine: Perception of Moral Judgement Made by Machines.”

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

He begins with a hypothetical in which you can swerve a car to kill one person instead of stay on its course and kill five. The audience chooses to swerve, and Edmond points out that we’re utilitarians. Second hypothesis: swerve into a barrier that will kill you but save the pedestrians. Most of us say we’d like it to swerve. Edmond points out that this is a variation of the trolley problem, except now it’s a machine that’s making the decision for us.

Autonomous cars are predicted to minimize fatalities from accidents by 90%. He says his advisor’s research found that most people think a car should swerve and sacrifice the passenger, but they don’t want to buy such a car. They want everyone else to.

He connects this to the Tragedy of the Commons in which if everyone acts to maximize their good, the commons fails. In such cases, governments sometimes issue regulations. Research shows that people don’t want the government to regulate the behavior of autonomous cars, although the US Dept of Transportation is requiring manufacturers to address this question.

Edmond’s group has created the moral machine, a website that creates moral dilemmas for autonomous cars. There have been about two million users and 14 million responses.

Some national trends are emerging. E.g., Eastern countries tend to prefer to save passengers more than Western countries do. Now the MIT group is looking for correlations with other factors, e.g., religiousness, economics, etc. Also, what are the factors most crucial in making decisions?

They are also looking at the effect of automation levels on the assignment of blame. Toyota’s “Guardian Angel” model results in humans being judged less harshly: that mode has a human driver but lets the car override human decisions.

Q&A

In response to a question, Edmond says that Mercedes has said that its cars will always save the passenger. He raises the possibility of the owner of such a car being held responsible for plowing into a bus full of children.

Q: The solutions in the Moral Machine seem contrived. The cars should just drive slower.

A: Yes, the point is to stimulate discussion. E.g., it doesn’t raise the possibility of swerving to avoid hitting someone who is in some way considered to be more worthy of life. [I’m rephrasing his response badly. My fault!]

Q: Have you analyzed chains of events? Does the responsibility decay the further you are from the event?

This very quickly gets game theoretical.
A:

Comments Off on [liveblog] Perception of Moral Judgment Made by Machines

October 11, 2016

[liveblog] Bas Nieland, Datatrix, on predicting customer behavior

At the PAPis conference Bas Nieland, CEO and Co-Founder of Datatrics, is talking about how to predict the color of shoes your customer is going to buy. The company tries to “make data science marketeer-proof for marketing teams of all sizes.” IT ties to create 360-degree customer profiles by bringing together info from all the data silos.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

They use some machine learning to create these profiles. The profile includes the buying phase, the best time to present choices to a user, and the type of persuasion that will get them to take the desired action. [Yes, this makes me uncomfortable.]

It is structured around a core API that talks to mongoDB and MySQL. They provide “workbenches” that work with the customer’s data systems. They use BigML to operate on this data.

The outcome are models that can be used to make recommendations. They use visualizations so that marketeers can understand it. But the marketeers couldn’t figure out how to use even simplified visualizations. So they created visual decision trees. But still the marketeers couldn’t figure it out. So they turn the data into simple declarative phrases: which audience they should contact, in which channel, what content, and when. E.g.:

“To increase sales, çontact your customers in the buying phase with high engagement through FB with content about jeans on sale on Thursday, around 10 o’clock.”

They predict the increase in sales for each action, and quantify in dollars the size of the opportunity. They also classify responses by customer type and phase.

For a hotel chain, they connected 16,000 variables and 21M data points, that got reduced to 75 variables by BigML which created a predictive model that ended up getting the chain more customer conversions. E.g., if the model says someone is in the orientation phase, the Web site shows photos of recommend hotels. If in the decision phase, the user sees persuasive messages, e.g., “18 people have looked at this room today.” The messages themselves are chosen based on the customer’s profile.

Coming up: Chatbot integration. It’s a “real conversation” [with a bot with a photo of an atttractive white woman who is supposedly doing the chatting]

Take-aways: Start simple. Make ML very easy to understand. Make it actionable.

Q&A

Me: Is there a way built in for a customer to let your model know that it’s gotten her wrong. E.g., stop sending me pregnancy ads because I lost the baby.

Bas: No.

Me: Is that on the roadmap?

Bas: Yes. But not on a schedule. [I’m not proud of myself for this hostile question. I have been turning into an asshole over the past few years.]

Comments Off on [liveblog] Bas Nieland, Datatrix, on predicting customer behavior

[liveblog] Vinny Senguttuvan on Predicting Customers

Vinny Senguttuvan is Senior Data Scientist at METIS. Before that, he was at Facebook-based gaming company, High 5 Games, which had 10M users. His talk at PAPIs: “Predicting Customers.”

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

The main challenge: Most of the players play for free. Only 2% ever spend money on the site, buying extra money to play. (It’s not gambling because you never cash out). 2% of those 2% contribute the majority of the revenue.

All proposed changes go through A/B testing. E.g., should we change the “Buy credits” button from blue to red. This is classic hypothesis testing. So you put up both options and see which gets the best results. It’s important to remember that there’s a cost to the change, so the A-B preference needs to be substantial enough. But often the differences are marginal. So you can increase the sample size. This complicates the process. “A long list of changes means not enough time per change.” And you want to be sure that the change affects the paying customers positively, which means taking even longer.

When they don’t have enough samples, they can bring down the confidence level required to make the change. Or they could bias one side of the hypothesis. And you can assume the variables are independent and run simultaneous A-B tests on various variables. High 5 does all three. It’s not perfect but it works.

Second, there is a poularity metric by which they rank or classify their 100 games. They constantly add games — it went from 15 to 100 in two years. This continuously changes the ranking of the games. Plus, some are launched locked. This complicates things. Vinny’s boss came up with a model of an n-dimensional casino, but it was too complex. Instead, they take 2 simple approaches: 1. An average-weighted spin. 2. Bayesian. Both predicted well but had flaws, so they used a type of average of both.

Third: Survival analysis. They wanted to know how many users are still active a given time after they created their account, and when is a user at risk of discontinuing use. First, they grouped users into cohorts (people who joined within a couple of weeks of each other) and plotted survival rates over time. They also observed return rates of users after each additional day of absence. They also implement a Cox survival model. They found that newer users were more likely to decline in their use of the product; early users are more committed. This pattern is widespread. That means they have to continuously acquire new players. They also alert users when they reach the elbow of disuse.

Fourth: Predictive lifetime value. Lifetime value = total revenue from a user over the entire time the the produced. This is significant because of costs: 10-15% of the rev goes into ads to acquire customers. Their 365 day prediction model should be a time series, but they needed results faster, so they flipped it into a regression problem, predicting the 365 day revenue based on the user’s first month data: how they spent, purchase count, days of play, player level achievement, and the date joined. [He talks about regression problems, but I can’t keep up.] At that point it cost $2 to acquire a customer from FB ad, and $6 from mobile apps. But when they tested, the mobile acquisitions were more profitable than those that came from through FB. It turned out that FB was counting as new users any player who hadn’t played in 30 days, and was re-charging them for it. [I hope I got that right.]

Fifth: Recommendation systems. Pandora notes the feature of songs and uses this to recommend similarities. YouTube makes recommendations made based on relations among users. Non-matrix factorization [I’m pretty sure he just made this up] gives you the ability to predict the score for a video that you know nothing about in terms of content. But what if the ratings are not clearly defined? At High 5, there are no explicit ratings. They calculated a rating based on how often a player plays it, how long the session, etc. And what do you do about missing values: use averages. But there are too many zeroes in the system, so they use sparse matrix solvers. Plus, there is a semi-order to the games, so they used some human input. [Useful for library Stackscores
?]

Comments Off on [liveblog] Vinny Senguttuvan on Predicting Customers

[liveblog] First panel: Building intelligent applications with machine learning

I’m at the PAPIs conference. The opening panel is about building intelligent apps with machine learning. The panelists are all representing companies. It’s Q&A with the audience; I will not be able to keep up well.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

The moderator asks one of the panelists (Snejina Zacharia from Insurify) how AI can change a heavily regulated audience such as insurance. She replies that the insurance industry gets low marks for customer satisfaction, which is an opportunity. Also, they can leverage the existing platforms and build modern APIs on stop of them. Also, they can explore how to use AI in existing functions, e.g., chatbots, systems that let users just confirm their identification rather than enter all the data. They also let users pick from an AI-filtered list of carriers that are right for them. Also, personalization: predicting risk and adjusting the questionnaire based on the user’s responses.

Another panelist is working on mapping for a company that is not Google and that is owned by three car companies. So, when an Audi goes over a bump, and then a Mercedes goes over it, it will record the same data. On personalization: it’s ripe for change. People are talking about 100B devices being connected by 2020. People think that RFID tags didn’t live up to their early hype, but 10 billion RFID tags are going to be sold this year. These can provide highly personalized, higher relevant data. This will be the base for the next wave of apps. We need a standards body effort, and governments addressing privacy and security. Some standards bodies are working on it, e.g., Global Standards 1, which manages the barcodes standard.

Another panelist: Why is marketing such a good opportunity for AI and ML? Marketers used to have a specific skill set. It’s an art: writing, presenting, etc. Now they’re being challenged by tech and have to understand data. In fact, now they have to think like scientists: hypothesize, experiment, redo the hypothesis… And now marketers are responsible for revenue. Being a scientist responsible for predictable revenue is driving interest in AI and ML. This panelist’s company uses data about companies and people to segmentize following up on leads, etc. [Wrong place for a product pitch, IMO, which is a tad ironic, isn’t it?]

Another panelist: The question is: how can we use predictive intelligence to make our applications better? Layer input intelligence on top of input-programming-output. For this we need a platform that provides services and is easy to attach to existing processes.

Q: Should we develop cutting edge tech or use what Google, IBM, etc. offer?

A: It depends on whether you’re an early adopter or straggler. Regulated industries have to wait for more mature tech. But if your bread and butter is based on providing the latest and greatest, then you should use the latest tech.

A: It also depends on whether you’re doing a vertically integrated solution or something broader.

Q: What makes an app “smart”? Is it: Dynamic, with rapidly changing data?

A: Marketers use personas, e.g., a handful of types. They used to be written in stone, just about. Smart apps update the personas after ever campaign, every time you get new info about what’s going on in the market, etc.

Q: In B-to-C marketing, many companies have built the AI piece for advertising. Are you seeing any standardization or platforms on top of the advertising channels to manage the ads going out on them?

A: Yes, some companies focus on omni-channel marketing.

A: Companies are becoming service companies, not product companies. They no longer hand off to retailers.

A: It’s generally harder to automate non-digital channels. It’s harder to put a revenue number on, say, TV ads.

Comments Off on [liveblog] First panel: Building intelligent applications with machine learning

[liveblog] PAPIs: Cynthia Rudin on Regulating Greed

I’m at the PAPIs (Predictive Applications and APIS) [twitter: papistotio] conference at the NERD Center in Cambridge.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

The first speaker is Cynthia Rudin, Director of the Prediction Analysis Lab at MIT. Her topic is “Regulating Greed over Time: An Important Lesson for Practical Recommender Systems.” It’s about her Lab’s entry in a data mining competition. (The entry did not win.) The competition was to design a better algorithm for Yahoo’s recommendation of articles. To create an unbiased data set they showed people random articles for two weeks. Your algorithm had to choose to show one of the pool of articles to a user. To evaluate a recommender system, they’d check if your algorithm recommended the same thing that was shown to the user. If the user clicked on it, you could get an evaluation. [I don’t think I got this right.] If so, you sent your algorithm to Yahoo, and they evaluated its clickthrough rate; you never got access to Yahoo’s data.

This is, she says, a form of the multi-arm bandit problem: one arm is better (more likely to lead to a pay out) but you don’t know which one. So you spend your time figuring out which arm is the best, and then you only pull that one. Yahoo and Microsoft are among the companies using multi-arm bandit systems for recommendation systems. “They’re a great alternative to massive A-B testing

] [No, I don’t understand this. Not Cynthia’s fault!.].

Because the team didn’t have access to Yahoo’s data, they couldn’t tune their algorithms to it. Nevertheless, they achieved a 9% clickthrough rate … and still lost (albeit by a tiny margin). Cynthia explains how they increased the efficiency of their algorithms, but it’s math so I can only here play the sound of a muted trumpet. But it involves “decay exploration on the old articles,” and a “peak grabber”: If any articles gets more than 9 clicks out of the last 100 times they show the article, and they keep displaying it: if you have a good article, grab it. The dynamic version of a Peak Grabber had them continuing to showing a peak article if it had a clickthrough rate 14% above the global clickthrough rate.

“We were adjusting the exploration-exploitation tradeoff based on trends.” Is this a phenomenon worth exploring?The phenomenon: you shouldn’t always explore. There are times when you should just stop and exploit the flowers.

Some data supports this. E.g., in England, on Boxing Day you should be done exploring and just put your best prices on things — not too high, not too low. When the clicks on your site are low, you should be exploring. When high, maybe not. “Life has patterns.” The Multiarm Bandit techniques don’t know about these patterns.

Her group came up with a formal way of putting this. At each time there is a known reward multiplier: G(t). G is like the number of people in the store. When G is high, you want to exploit, not explore. In the lower zones you want to balance exploration and exploitation.

So they created two theorems, each leading to an algorithm. [She shows the algorithm. I can’t type in math notation that fast..]

Comments Off on [liveblog] PAPIs: Cynthia Rudin on Regulating Greed

September 21, 2016

[iab] Frances Donegan-Ryan

At the IAB conference, Frances Donegan-Ryan from Bing begins by reminding us of the history of online search.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

We all leave digital footprints, she says. Every time we search, data is recorded. The sequence of our searches gives especially useful information to help the engine figure out what you’re trying to find out. Now the engines can refer to social graphs.

“But what do we do with data?”

Bing Predicts
looks at all the data it can in order to make predictions. It began by predicting the winners and losers in American Idol, and got it 100% right. For this election year, it tried to predict who would win each state primary or caucus in the US. Then it took in sentiment data to figure out which issues matter in each state, broken down by demographic groups.

Now, for example, it can track a new diabetes drug through the places people visit when logged into their browser. This might show that there are problems with the drug; consider for example people searching for unexpected side effects of it. Bing shares the result of this analysis with the CDC. [The acoustics where I was sitting was poor. I’m not sure I got this right.]

They’re doing the same for retail products, and are able to tell which will be the big sellers.

Frances talks about Cortana, “the only digital system that works across all platforms.” Microsoft is working on many more digital assistants — Bots
— that live within other services. She shows a temporary tattoo
made from gold leaf that you can use as a track pad, and other ways; this came out of MIT.

She says that the Microsoft version of a Fitbit can tell if you’re dehydrated or tired, and then can point you to the nearest place with water and a place to sit. Those shops could send you a coupon.

She goes quickly over the Hololens since Robert Scoble covered it so well this morning.

She closes with a story about using sensor data to know when a cow is in heat, which, it turns out, correlates with them walking faster. Then the data showed at what point in the period of fertility a male or female cow is likely to be conceived. Then they started using genetic data to predict genetically disabled calves.

It takes enormous computing power to do this sort of data analysis.

Comments Off on [iab] Frances Donegan-Ryan

[iab] Privacy discussion

I’m at the IAB conference in Toronto. Canada has a privacy law, PIPEDA law (The Personal Information Protection and Electronic Documents Act) passed in 2001, based on OECD principles.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Barbara Bucknell
, the director of policy and research at Office of the Privacy Commissioner where she worries about how to protect privacy while being able to take advantage of all the good stuff data can do.

A recent large survey found that more than half of Canadians are more concerned about privacy than they were last year. Only 34% think the govt is doing enough to keep their privacy safe. Globally, 8 out of 10 are worried about their info being bought, sold, or monitored. “Control is the key concern here.” “They’re worried about surprises: ‘Oh, I didn’t know you were using my information that way!'”

Adam Kardash [this link
?] says that all the traditional approaches to privacy have be carefully reconsidered. E.g., data minimization says you only collect what you need. “It’s a basic principle that’s been around forever.” But data scientists, when asked how much data they need for innovation, will say “We need it all.” Also, it’s incredibly difficult to explain how your data is going to be used, especially at the grade 6-7 literacy rate that is required. And for data retention, we should keep medical info forever. Marketers will tell you the same thing so they can give you information about you what you really need.

Adam raises the difficulties with getting consent, which the OPC opened a discussion about. Often asking for consent is a negligible part of the privacy process. “The notion of consent is having an increasingly smaller role” while the question of control is growing.

He asks Barbara “How does PEPIDA facility trust?”

Barbara: It puts guardrails into the process. They may be hard implement but they’re there for a reason. The original guidelines from the OECD were prescient. “It’s good to remember there were reasons these guardrails were put in place.”

Consent remains important, she says, but there are also other components, including accountability. The organization has to protect data and be accountable for how it’s used. Privacy needs to be built into services and into how your company is organized. Are the people creating the cool tech talking to the privacy folks and to the legal folks? “Is this conversation happening at the front end?” You’d be surprised how many organizations don’t have those kind of programs in place.

Barbara: Can you talk to the ethical side of this?

Adam: Companies want to know how to be respectful as part of their trust framework, not just meeting the letter of the law. “We believe that the vast majority of Big Data processing can be done within the legal framework. And then we’re creating a set of questions” in order for organisations to feel comfortable that what they’re doing is ethical. This is very practical, because it forestalls law suits. PEPIDA says that organizations can only process data for purposes a reasonable person would consider appropriate. We think that includes the ethical concerns.

Adam: How can companies facilitate trust?

Barbara: It’s vital to get these privacy management programs into place that will help facilitate discussions of what’s not just legal but respectful. And companies have to do a better job of explaining to individuals how they’re using their data.

Comments Off on [iab] Privacy discussion

March 1, 2016

[berkman] Dries Buytaert

I’m at a Berkman [twitter: BerkmanCenter] lunchtime talk (I’m moderating, actually) where Dries Buytaert is giving a talk about some important changes in the Web.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

He begins by recounting his early days as the inventor of Drupal, in 2001. He’s also the founder of Acquia, one of the fastest growing tech companies in the US. It currently has 750 people working on products and services for Drupal. Drupal is used by about 3% of the billion web sites in the world.

When Drupal started, he felt he “could wrap his arms” around everything going on on the Web. Now that’s impossible, he says. E.g, Google AdWords were just starting, but now AdWords is a $65B business. The mobile Web didn’t exist. Social media didn’t yet exist. Drupal was (and is) Open Source, a concept that most people didn’t understand. “Drupal survived all of these changes in the market because we thought ahead” and then worked with the community.

“The Internet has changed dramatically” in the past decade. Big platforms have emerged. They’re starting to squeeze smaller sites out of the picture. There’s research that shows that many people think that Facebook is the Internet. “How can we save the open Web?,” Dries askes.

What do we mean by the open or closed Web? The closed Web consists of walled gardens. But these walled gardens also do some important good things: bringing millions of people online, helping human rights and liberties, and democratizing the sharing of information. But, their scale is scary . FB has 1.6B active users every month; Apple has over a billion IoS devices. Such behemoths can shape the news. They record data about our behavior, and they won’t stop until they know everything about us.

Dries shows a table of what the different big platforms know about us. “Google probably knows the most about us” because of gMail.

The closed web is winning “because it’s easier to use.” E.g., After Dries moved from Belgium to the US, Facebook and etc. made it much easier to stay in touch with his friends and family.

The open web is characterized by:

  1. Creative freedom — you could create any site you wanted and style it anyway you pleased

  2. Serendipity. That’s still there, but it’s less used. “We just scroll our FB feed and that’s it.”

  3. Control — you owned your own data.

  4. Decentralized — open standards connected the pieces

Closed Web:

  1. Templates dictate your creative license

  2. Algorithms determine what you see

  3. Privacy is in question

  4. Information is siloed

The big platforms are exerting control. E.g., Twitter closed down its open API so it could control the clients that access it. FB launched “Free Basics” that controls which sites you can access. Google lets people purchase results.

There are three major trends we can’t ignore, he says.

First, there’s the “Big Reverse of the Web,” about which Dries has been blogging about. “We’re in a transformational stage of the Web,” flipping it on its head. We used to go to sites and get the information we want. Now information is coming to us. Info, products, and services will come to us at the right time on the right device.”

Second, “Data is eating the world.”

Third, “Rise of the machines.”

For example, “content will find us,” AKA “mobile or contextual information.” If your flight is cancelled, the info available to you at the airport will provide the relevant info, not offer you car rentals for when you arrive. This creates a better user experience, and “user experience always wins.”

Will the Web be open or closed? “It could go either way.” So we should be thinking about how we can build data-driven, user-centric algorithms. “How can we take back control over our data?” “How can we break the silos” and decentralize them while still offering the best user experience. “How do we compete with Google in a decentralized way. Not exactly easy.”

For this, we need more transparency about how data is captured and used, but also how the algorithms work. “We need an FDA for data and algorithms.” (He says he’s not sure about this.) “It would be good if someone could audit these algorithms,” because, for example, Google’s can affect an election. But how to do this? Maybe we need algorithms to audit the algorithms?

Second, we need to protect our data. Perhaps we should “build personal information brokers.” You unbundle FB and Google, put the data in one place, and through APIs give apps access to them. “Some organizations are experimenting with this.”

Third, decentralization and a better user experience. “For the open web to win, we need to be much better to use.” This is where Open Source and open standards come in, for they allow us to build a “layer of tech that enables different apps to communicate, and that makes them very easy to use.” This is very tricky. E.g., how do you make it easy to leave a comment on many different sites without requiring people to log in to each?

It may look almost impossible, but global projects like Drupal can have an impact, Dries says. “We have to try. Today the Web is used by billions of people. Tomorrow by more people.” The Internet of Things will accelerate the Net’s effect. “The Net will change everything, every country, every business, every life.” So, “we have a huge responsibility to build the web that is a great foundation for all these people for decades to come.”

[Because I was moderating the discussion, I couldn’t capture it here. Sorry.]

Comments Off on [berkman] Dries Buytaert

Next Page »