Joho the Blogmachine learning Archives - Joho the Blog

December 17, 2017

[liveblog] Ulla Richardson on a game that teaches reading

I’m at the STEAM ed Finland conference in Jyväskylä where Ulla Richardson is going to talk about GraphoLearn, an adaptive learning method for learning to read.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.


Ulla has been working on the Jyväskylä< Longitudinal Study of Dyslexia (JLD). Globally, one third of people can’t read or have poor reading skills. One fifth of Europe also. About 15% of children have learning disabilities.


One Issue: knowing which sound goes with which letters. GraphoLearn is a game to help students with this, developed by a multidisciplinary team. You learn a word by connecting a sound to a written letter. Then you can move to syllables and words. The game teaches by trial and error. If you get it wrong, it immediately tells you the correct sound. It uses a simple adaptive approach to select the wrong choices that are presented. The game aims at being entertaining, and motivates also with points and rewards. It’s a multi-modal system: visual and audio. It helps dyslexics by training them on the distinctions between sounds. Unlike human beings, it never displays any impatience.

It adapts to the user’s skill level, automatically assessing performance and aiming at at 80% accuracy so that it’s challenging but not too challenging.


13,000 players have played in Finland, and more in other languages. Ulla displays data that shows positive results among students who use GraphoLearn, including when teaching English where every letter has multiple pronunciations.


There are some difficulties analyzing the logs: there’s great variability in how kids play the game, how long they play, etc. There’s no background info on the students. [I missed some of this.] There’s an opportunity to come up with new ways to understand and analyze this data.


Q&A


Q: Your work is amazing. When I was learning English I could already read Finnish, so I made natural mispronunciations of ape, anarchist, etc. How do you cope with this?


A: Spoken and written English are like separate languages, especially if Finnish is your first language where each letter has only one pronunciation. You need a bigger unit to teach a language like English. That’s why we have the Rime approach where we show the letters in more context. [I may have gotten this wrong.]


Q: How hard is it to adapt the game to each language’s logic?


A: It’s hard.

Be the first to comment »

December 16, 2017

[liveblog] Mirka Saarela and Sanna Juutinen on analyzing education data

I’m at the STEAM ed Finland conference in Jyväskylä. Mirka Saarela and Sanna Juutinen are talking about their analysis of education data.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.


There’s a triennial worldwide study by the OECD to assess students. Usually, people are only interested in its ranking of education by country. Finland does extremely well at this. This is surprising because Finland does not do particularly well in the factors that are taken to produce high quality educational systems. So Finnish ed has been studied extensively. PISA augments this analysis using learning analytics. (The US does at best average in the OECD ranking.)


Traditional research usually starts with the literature, develops a hypothesis, collects the data, and checks the result. PISA’s data mining approach starts with the data. “We want to find a needle in the haystack, but we don’t know what the needle looks like.” That is, they don’t know what type of pattern to look for.


Results of 2012 PISA: If you cluster all 24M students with their characteristics and attitudes without regard to their country you get clusters for Asia, developing world, Islamic, western countries. So, that maps well.


For Finland, the most salient factor seems to be its comprehensive school system that promotes equality and equity.

In 2015 for the first time there was a computerized test environment available. Most students used it. The logfile recorded how long students spent on a task and the number of activities (mouse clicks, etc.) as well as the score. They examined the Finnish log file to find student profiles, related to student’s strategies and knowledge. Their analysis found five different clusters. [I can’t read the slide from here. Sorry.] They are still studying what this tells us. (They purposefully have not yet factored in gender.)


Nov. 2017 results showed that girls did far better than boys. The test was done in a chat environment which might have been more familiar for the girls? Is the computerization of the tests affecting the results? Is the computerization of education affecting the results? More research is needed.


Q&A


Q: Does the clustering suggest interventions? E.g., “Slow down. Less clicking.”

A: [I couldn’t quite hear the answer, but I think the answer is that it needs more analysis. I think.]


Q: I work for ETS. Are the slides available?


A: Yes, but the research isn’t public yet.

Be the first to comment »

[liveblog] Harri Ketamo on micro-learning

I’m at the STEAM ed Finland conference in Jyväskylä. Harri Ketamo is giving a talk on “micro-learning.” He recently won a prestigious prize for the best new ideas in Finland. He is interested in the use of AI for learning.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

We don’t have enough good teachers globally, so we have to think about ed in new ways, Harri says. Can we use AI to bring good ed to everyone without hiring 200M new teachers globally? If we paid teachers equivalent to doctors and lawyers, we could hire those 200M. But we apparently not willing to do that.


One challenge: Career coaching. What do you want to study? Why? What are the skills you need? What do you need to know?


His company does natural language analysis — not word matches, but meaning. As an example he shows a shareholder agreement. Such agreements always have the same elements. After being trained on law, his company’s AI can create a map of the topic and analyze a block of text to see if it covers the legal requirements…the sort of work that a legal assistant does. For some standard agreements, we may soon not need lawyers, he predicts.


The system’s language model is a mess of words and relations. But if you zoom out from the map, the AI has clustered the concepts. At the Slush Sanghai conference, his AI could develop a list of the companies a customer might want to meet based on a text analysis of the companies’ web sites, etc. Likewise if your business is looking for help with a project.


Finland has a lot of public data about skills and openings. Universities’ curricula are publicly available.[Yay!] Unlike LinkedIn, all this data is public. Harri shows a map that displays the skills and competencies Finnish businesses want and the matching training offered by Finnish universities. The system can explore public information about a user and map that to available jobs and the training that is required and available for it. The available jobs are listed with relevancy expressed as a percentage. It can also look internationally to find matches.


The AI can also put together a course for a topic that a user needs. It can tell what the core concepts are by mining publications, courses, news, etc. The result is an interaction with a bot that talks with you in a Whatsapp like way. (See his paper “Agents and Analytics: A framework for educational data mining with games based learning”). It generates tests that show what a student needs to study if she gets a question wrong.


His newest project, in process: Libraries are the biggest collections of creative, educational material, so the AI ought to point people there. His software can find the common sources among courses and areas of study. It can discover the skills and competencies that materials can teach. This lets it cluster materials around degree programs. It can also generate micro-educational programs, curating a collection of readings.

His platform has an open an API. See Headai.

Q&A


Q: Have you done controlled experiments?


A: Yes. We’ve found that people get 20-40% better performance when our software is used in blended model, i.e., with a human teacher. It helps motivate people if they can see the areas they need to work on disappear over time.


Q: The sw only found male authors in the example you put up of automatically collated materials.


A: Small training set. Gender is not part of the metadata in Finland.


A: Don’t you worry that your system will exacerbate bias?


Q: Humans are biased. AI is a black box. We need to think about how to manage this


Q: [me] Are the topics generated from the content? Or do you start off with an ontology?


A: It creates its ontology out of the data.


Q: [me] Are you committing to make sure that the results of your AI do not reflect the built in biases?


A: Our news system on the Web presents a range of views. We need to think about how to do this for gender issues with the course software.

Be the first to comment »

December 5, 2017

[liveblog] Conclusion of Workshop on Trustworthy Algorithmic Decision-Making

I’ve been at a two-day workshop sponsored by the Michigan State Uiversity and the National Science Foundation: “Workshop on Trustworthy Algorithmic Decision-Making.” After multiple rounds of rotating through workgroups iterating on five different questions, each group presented its findings — questions, insights, areas of future research.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Seriously, I cannot capture all of this.

Conduct of Data Science

What are the problems?

  • Who defines and how do we ensure good practice in data science and machine learning?

Why is the topic important? Because algorithms are important. And they have important real-world effects on people’s lives.

Why is the problem difficult?

  • Wrong incentives.

  • It can be difficult to generalize practices.

  • Best practices may be good for one goal but not another, e.g., efficiency but not social good. Also: Lack of shared concepts and vocabulary.

How to mitigate the problems?

  • Change incentives

  • Increase communication via vocabularies, translations

  • Education through MOOCS, meetups, professional organizations

  • Enable and encourage resource sharing: an open source lesson about bias, code sharing, data set sharing

Accountability group

The problem: How to integratively assess the impact of an algorithmic system on the public good? “Integrative” = the impact may be positive and negative and affect systems in complex ways. The impacts may be distributed differently across a population, so you have to think about disparities. These impacts may well change over time

We aim to encourage work that is:

  • Aspirationally casual: measuring outcomes causally but not always through randomized control trials.

  • The goal is not to shut down algorithms to to make positive contributions that generat solutions.

This is a difficult problem because:

  • Lack of variation in accountability, enforcements, and interventions.

  • It’s unclear what outcomes should be measure and how. This is context-dependent

  • It’s unclear which interventions are the highest priority

Why progress is possible: There’s a lot of good activity in this space. And it’s early in the topic so there’s an ability to significantly influence the field.

What are the barriers for success?

  • Incomplete understanding of contexts. So, think it in terms of socio-cultural approaches, and make it interdisciplinary.

  • The topic lies between disciplines. So, develop a common language.

  • High-level triangulation is difficult. Examine the issues at multiple scales, multiple levels of abstraction. Where you assess accountability may vary depending on what level/aspect you’re looking at.

Handling Uncertainty

The problem: How might we holistically treat and attribute uncertainty through data analysis and decisions systems. Uncertainty exists everywhere in these systems, so we need to consider how it moves through a system. This runs from choosing data sources to presenting results to decision-makers and people impacted by these results, and beyond that its incorporation into risk analysis and contingency planning. It’s always good to know where the uncertainty is coming from so you can address it.

Why difficult:

  • Uncertainty arises from many places

  • Recognizing and addressing uncertainties is a cyclical process

  • End users are bad at evaluating uncertain info and incorporating uncertainty in their thinking.

  • Many existing solutions are too computationally expensive to run on large data sets

Progress is possible:

  • We have sampling-based solutions that provide a framework.

  • Some app communities are recognizing that ignoring uncertainty is reducing the quality of their work

How to evaluate and recognize success?

  • A/B testing can show that decision making is better after incorporating uncertainty into analysis

  • Statistical/mathematical analysis

Barriers to success

  • Cognition: Train users.

  • It may be difficult to break this problem into small pieces and solve them individually

  • Gaps in theory: many of the problems cannot currently be solved algorithmically.

The presentation ends with a note: “In some cases, uncertainty is a useful tool.” E.g., it can make the system harder to game.

Adversaries, workarounds, and feedback loops

Adversarial examples: add a perturbation to a sample and it disrupts the classification. An adversary tries to find those perturbations to wreck your model. Sometimes this is used not to hack the system so much as to prevent the system from, for example, recognizing your face during a protest.

Feedback loops: A recidivism prediction system says you’re likely to commit further crimes, which sends you to prison, which increases the likelihood that you’ll commit further crimes.

What is the problem: How should a trustworthy algorithm account for adversaries, workarounds, and feedback loops?

Who are the stakeholders?

System designers, users, non-users, and perhaps adversaries.

Why is this a difficult problem?

  • It’s hard to define the boundaries of the system

  • From whose vantage point do we define adversarial behavior, workarounds, and feedback loops.

Unsolved problems

  • How do we reason about the incentives users and non-users have when interacting with systems in unintended ways.

  • How do we think about oversight and revision in algorithms with respect to feedback mechanisms

  • How do we monitor changes, assess anomalies, and implement safeguards?

  • How do we account for stakeholders while preserving rights?

How to recognize progress?

  • Mathematical model of how people use the system

  • Define goals

  • Find stable metrics and monitor them closely

  • Proximal metrics. Causality?

  • Establish methodologies and see them used

  • See a taxonomy of adversarial behavior used in practice

Likely approaches

  • Security methodology to anticipating and unintended behaviors and adversarial interactions’. Monitor and measure

  • Record and taxonomize adversarial behavior in different domains

  • Test . Try to break things.

Barriers

  • Hard to anticipate unanticipated behavior

  • Hard to define the problem in particular cases.

  • Goodhardt’s Law

  • Systems are born brittle

  • What constitutes adversarial behavior vs. a workaround is subjective.

  • Dynamic problem

Algorithms and trust

How do you define and operationalize trust.

The problem: What are the processes through which different stakeholders come to trust an algorithm?

Multiple processes lead to trust.

  • Procedural vs. substantive trust: are you looking at the weights of the algorithms (e.g.), or what were the steps to get you there?

  • Social vs personal: did you see the algorithm at work, or are you relying on peers?

These pathways are not necessarily predictive of each other.

Stakeholders build truth through multiple lenses and priorities

  • the builders of the algorithms

  • the people who are affected

  • those who oversee the outcomes

Mini case study: a child services agency that does not want to be identified. [All of the following is 100% subject to my injection of errors.]

  • The agency uses a predictive algorithm. The stakeholders range from the children needing a family, to NYers as a whole. The agency knew what into the model. “We didn’t buy our algorithm from a black-box vendor.” They trusted the algorithm because they staffed a technical team who had credentials and had experience with ethics…and who they trusted intuitively as good people. Few of these are the quantitative metrics that devs spend their time on. Note that FAT (fairness, accountability, transparency) metrics were not what led to trust.

Temporality:

  • Processes that build trust happen over time.

  • Trust can change or maybe be repaired over time. “

  • The timescales to build social trust are outside the scope of traditional experiments,” although you can perhaps find natural experiments.

Barriers:

  • Assumption of reducibility or transfer from subcomponents

  • Access to internal stakeholders for interviews and process understanding

  • Some elements are very long term

 


 

What’s next for this workshop

We generated a lot of scribbles, post-it notes, flip charts, Slack conversations, slide decks, etc. They’re going to put together a whitepaper that goes through the major issues, organizing them, and tries to capture the complexity while helping to make sense of it.

There are weak or no incentives to set appropriate levels of trust

Key takeways:

  • Trust is irreducible to FAT metrics alone

  • Trust is built over time and should be defined in terms of the temporal process

  • Isolating the algorithm as an instantiation misses the socio-technical factors in trust.

Be the first to comment »

December 4, 2017

Workshop: Trustworthy Algorithmic Decision-Making

I’m at a two-day inter-disciplinary workshop on “Trustworthy Algorithmic Decision-Making” put on by the National Science Foundation and Michigan State University. The 2-page whitepapers
from the participants are online. (Here’s mine.) I may do some live-blogging of the workshops.

Goals:

– Key problems and critical qustionos?

– What to tell pol;icy-makers and others about the impact of these systems?

– Product approaches?

– What ideas, people, training, infrastructure are needed for these approaches?

Excellent diversity of backgrounds: CS, policy, law, library science, a philosopher, more. Good diversity in gender and race. As the least qualified person here, I’m greatly looking forward to the conversations.

Be the first to comment »

December 2, 2017

[liveblog] Doaa Abu-Elyounes on "Bail or Jail? Judicial vs. Algorithmic decision making"

I’m at a weekly AI talk put on by Harvard’s Berkman Klein Center for Internet & Society and the MIT Media Lab. Doaa Abu-Elyounes is giving a talk called “Bail or Jail? Judicial vs. Algorithmic decision making”.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Doaa tells us that this talk is a work in progress.

We’ve all heard now about AI-based algorithms that are being used to do risk assessments in pretrial bail decisions. She thinks this is a good place to start using algorithms, although it’s not easy.

The pre-trial stage is supposed to be very short. The court has to determine if the defendant, presumed innocent, will be released on bail or jailed. The sole considerations are supposed to be whether the def is likely to harm someone else or flee. Preventive detention has many efffects, mostly negative for the defendant.
(The US is a world leader in pre-trial detainees. Yay?)

Risk assessment tools have been used for more than 50 years. Actuarial tools have shown greater predictive power than clinical judgment, and can eliminate some of the discretionary powers of judges. Use of these tools have long been controversy What type of factors to include in the power? Is the use of demographic factors to make predictions fair to individuals?

Existing tools use regression analysis. Now machine learning can learn from much more data. Mechanical predictions [= machine learning] are more accurate than statistical predictions, but may not be explicable.

We think humans can explain their decisions and we want machines to be able to as well. But look at movie reviews. Humans can tell if a review is positive. We can teach which words are positive or negative, getting 60% accuracy. Or we can have a human label the reviews as positive or negative and let the machine figure out what the factor are — via machine leaning — in which case we get 80% accuracy but may lose explicability.

With pretrial situations, what is the automated task is that the machine should be performing?

There’s a tension between accuracy and fairness. Computer scientists are trying to quantify these questions What does a fair algorithm look like? John Kleinberg and colleagues did a study of this [this one?]. Their algorithms reduced violent crime by 25% with no change in jailing rates, without increasing racial disparities. In short, the algorithm seems to have done a more accurate job with less bias.

Doaa lists four assessment tools she will be looking at: the Pretrial Risk Assessment [this one?], the Public Safety Assessment, the Virginia Pretrial Risk assessment Instrument and the Colorado Pretrial Assessment Tool.

Doaa goes through questions that should be asked of these tools, beginning with: Which factors are considered in each? [She dives into the details for all four tools. I can’t capture it. Sorry.]

What are the sources of data? (3 out of 4 rely on interviews and databases.)

What is the quality of the data? “This is the biggest problem jurisdictions are dealing with when using such a tool.” “Criminal justice data is notoriously poor.” And, of course, if a machine learning system is trained on discriminatory data, its conclusions are likely to reflect those biases.

The tools neeed to be periodically validated using data from its own district’s population. Local data matters.

There should be separate scores for flight risk and public safety All but the PSA provide only a single score. This is important because there are separate remedies for the two concerns. E.g., you might want to lock up someone who is a risk to public safety, but take away the passport of someone who is a flight risk.

Finally, the systems should discriminate among reasons for flight risk. E.g., because the defendant can’t afford the cost of making it to court or because she’s fleeing?

Conclusion: Pretrial is the front door of the criminal justice system and affects what happens thereafter. Risk assessment tools should not replace judges, but they bring benefits. They should be used, and should be made as transparent as possible. There are trade offs. The tool will not eliminate all bias but might help reduce it.

Q&A

Q: Do the algorithms recognize the different situations of different defendants?

A: Systems do recognize this, but not in sophisticated ways. That’s why it’s important to understand why a defendant might be at risk of missing a court date. Maybe we could provide poor defendants with a Metro card.

Q: Could machine learning be used to help us be more specific in the types of harm? What legal theories might we drawn on to help with this?

A: [The discussion got too detailed for me to follow. Sorry.]

Q: There are different definitions of recidivism. What do we do when there’s a mismatch between the machines and the court?

A: Some states give different weights to different factors based on how long ago the prior crimes were committed. I haven’t seen any difference in considering how far ahead the risk of a possible next crime is.

Q: [me] While I’m very sympathetic to allowing machine learning to be used without always requiring that the output be explicable, when it comes to the justice system, do we need explanations so not only is justice done, but we can have trust that it’s being done?

A: If we can say which factors are going into a decision — and it’s not a lot of them — if the accuracy rate is much higher than manual systems, then maybe we can give up on always being able to explain exactly how it came to its decisions. Remember, pre-trial procedures are short and there’s usually not a lot of explaining going on anyway. It’s unlikely that defendants are going to argue over the factors used.

Q: [me] Yes, but what about the defendant who feels that she’s being treated differently than some other person and wants to know why?

A: Judges generally don’t explain how they came to their decisions anyway. The law sets some general rules, and the comparisons between individuals is generally within the framework of those rules. The rules don’t promise to produce perfectly comparable results. In fact, you probably can’t easily find two people with such similar circumstances. There are no identical cases.

Q: Machine learning, multilevel regression level, and human decision making all weigh data and produce an outcome. But ML has little human interaction, statistical analysis has some, and the human decision is all human. Yet all are in fact algorithmic: the judge looks at a bond schedule to set bail. Predictability as fairness is exacerbated by the human decisions since the human cannot explain her model.

Q: Did you find any logic about why jurisdictions picked which tool? Any clear process for this?

A: It’s hard to get that information about the procurement process. Usually they use consultants and experts. There’s no study I know of that looks at this.

Q: In NZ, the main tool used for risk assessment for domestic violence is a Canadian tool called ODARA. Do tools work across jurisdictions? How do you reconcile data sets that might be quite different?

A: I’m not against using the same system across jurisdictions — it’s very expensive to develop one from scratch — but they need to be validated. The federal tool has not been, as far as I know. (It was created in 2009.) Some tools do better at this than others.

Q: What advice would you give to a jurisdiction that might want to procure one? What choices did the tools make in terms of what they’re optimized for? Also: What about COMPAS?

A: (I didn’t talk about COMPAS because it’s notorious and not often used in pre-trial, although it started out as a pre-trial tool.) The trade off seems to be between accuracy and fairness. Policy makers should define more strictly where the line should be drawn.

Q: Who builds these products?

A: Three out of the four were built in house.

Q: PSA was developed by a consultant hired by the Arnold Foundation. (She’s from Luminosity.) She has helped develop a number of the tools.

Q: Why did you decide to research this? What’s next?

A: I started here because pre-trial is the beginning of the process. I’m interested in the fairness question, among other things.

Q: To what extent are the 100+ factors that the Colorado tool considers available publicly? Is their rationale for excluding factors public? Because they’re proxies for race? Because they’re hard to get? Or because back then 100+ seemed like too many? And what’s the overlap in factors between the existing systems and the system Kleinberg used?

A: Interviewing defendants takes time, so 100 factors can be too much. Kleinberg only looked at three factors. Another tool relied on six factors.

Q: Should we require private companies to reveal their algorithms?

A: There are various models. One is to create an FDA for algorithms. I’m not sure I support that model. I think private companies need to expose at least to the govt the factors that they’re including. Others would say I’m too optimistic about the government.

Q: In China we don’t have the pre-trial part, but there’s an article saying that they can make the sentencing more fair by distinguishing among crimes. Also, in China the system is more uniform so the data can be aggregated and the system can be made more accurate.

A: Yes, states are different because they have different laws. Exchanging data between states is not very common and may not even be possible.

Be the first to comment »

October 27, 2017

[liveblog] Nathan Matias on The Social impact of real-time algorithm decisions

J. Nathan Matias is giving a talk at the weekly AI session held by MIT Media Lab and Harvard’s Berkman Klein Center for Internet & Society. The title is: Testing the social impact of real-time algorithm decisions. (SPOILER: Nate is awesome.) Nathan will be introducing CivilServant.io to us, a service for researching the effects of tech and how it can be better directed to toward the social outcomes we (the civil society “we”) desire. (That’s my paraphrase.)

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

In 2008, the French government approved a law against Web sites that encourage anorexia and bulimia. In 2012, Instagram responded to pressure to limit hashtags that “actively promote self-harm.” Instagram had 40M users, almost as many as France’s 55M active Net users. Researchers at Georgia Tech several years later found that some self-harm sites on Instagram had higher engagement after Instagram’s actions. “ If your algorithm reliably detects people who are at risk of committing suicide, what next? ” If your algorithm reliably detects people who are at risk of committing suicide, what next? If the intervention isn helpful, your algorithm is doing harm.

Nathan shows a two-axis grid for evaluating algorithms: fair-unfair and benefits-harms. Accuracy should be considered to be on the same axis as fairness because it can be measured mathematically. But you can’t test the social impact without putting it into the field. “I’m trying to draw attention to the vertical axis [harm-benefit].”

We often have in mind a particular pipeline: training > model > prediction > people . Sometimes there are rapid feedback loops where the decisions made by people feed back into the model. A judicial system’s prediction risk scores may have no such loop. But the AI that manages a news feed is probably getting the readers’ response as data that tunes the model.

We have organizations that check the quality of items we deal with: UL for electrical products, etc. But we don’t have that sort of consumer protection for social tech. The results are moral panics, bad policies, etc. This is the gap Nate is trying to fill with CivilServant.io, a project supported by the Media Lab and GlobalVoices.

Here’s an example of one of CivilServant’s projects:

Managing fake news is essential for democracy. The social sciences have been dealing with this for quite a while by doing research on individual perception and beliefs, on how social context and culture influence beliefs … and now on algorithms that make autonomous decisions that affect us as citizens e.g., newsfeeds. Newsfeeds work this way: someone posts a link. People react to it, e.g. upvote, discuss, etc. The feed service watches that behavior and uses it to promote or demote the item. And then it feeds back in.

We’ve seen lots of examples of pernicious outcomes of this. E.g., at Reddit an early upvote can have dramatic impact on its ratings over time.

What can we do to govern online misinfo? We could surveill and censor. We could encourage counter-speech. We can imagine some type of algorithmic governance. We can use behavioral nudges, e.g. Facebook tagging articles as “disputed.” But all of these assume that these interventions change behaviors and beliefs. Those assumptions are not always tested.

Nate was approached by /r/worldnews at Reddit, a subreddit with14M subscribers and 70 moderators. At Reddit, moderating can be a very time consuming effort. (Nate spoke to a Reddit mod who had stopped volunteering at a children’s hospital in order to be a mod because she thought she could do more good that way.) This subreddit’s mods wanted to know if they could question the legitimacy of an item without causing it to surge on the platform. Fact-checking a post could nudge Reddit’s AI to boost its presence because of the increased activity.

So, they did an experiment asking people to fact check an article, or fact check and downvote if you can’t verify it. They monitored the ranking of the articles by Reddit for 3 months. [Nate now gives some math. Sorry I can’t capture (or understand) it.] The result: to his surprise, “encouraging fact checking reduced the average rank position of an article”encouraging fact checking reduced the average rank position of an article. Encouraging fact checking and down-voting reduced the spread of inaccurate news by Reddit’s algorithms. [I’m not confident I’m getting that right

Why did encouraging fact checking reduce rankings, but fact checking and voting did not? The mods think this might be because it gave users a constructive way to handle articles from reviled sources, reducing the number of negative comments about them. [I hope I’m getting this right.] Also, “reactance” may have nudged people to upvote just to spite the instructions. Also, users may have mobilized friends to vote on the artciles. Also, encouraging two tasks (fact check and then vote) rather than one may have influenced he timing of the algorithm, making the down-votes less impactful.

This is what Nate calls an “AI-Nudge”: a “second-order effect of influencing human behavior on the behavior of an algorithmic system.” It means you have to think about how humans interact with AI.

Often when people are working on AI, they’re starting from computer science and math. The question is: how can we use social science methods to research the effect of AI? Paluck and Cialdini see a cycle of Pilot/Lab experiments > qualitative methods > field experiences > theory / policy / design. In the Reddit example, Nathan spent considerable time with the community to understand their issues and how they interact with the AI.

Another example of a study: identifying and reducing side-effects of automated copyright law enforcement on Twitter. When people post something to Twitter, bots monitor it to see if violates copyright, resulting in a DMCA takedown notice being issued. Twitter then takes it down. The Lumen Project from BKC archives these notices. The CivilService project observes those notices in real time to study the effects. E.g., “a user’s tweets per day tends to drop after they receive a takedown notice … for a 42-day period”a user’s tweets per day tends to drop after they receive a takedown notice, and then continues dropping throughout the 42-day period they researched. Why this long-term decrease in posting? Maybe fear and risk. Maybe awareness of surveillance.

So, how can these chilling effects be reduced? The CivilService project automatically sends users info about their rights and about surveillance. The results of this intervention are not in yet. The project hopes to find ways to lessen the public’s needless withdrawal from social media. The research can feed empirical legal studies. Policymakers might find it useful. Civil rights orgs as well. And the platforms themselves.

In the course of the Q&As, Nathan mentions that he’s working on ways to explain social science research that non-experts can understand. CivilService’s work is with user communities and it’s developed a set of ways for communicating openly with the users.

Q: You’re trying to make AI more fair…

A: I’m doing consumer protection, so as experts like you work on making AI more fair, we can see the social effects of interventions. But there are feedback loops among them.

Q: What would you do with a community that doesn’t want to change?

A: We work with communities that want our help. In the 1970s, Campbell wrote an essay: “The Experimenting Society.” He asked if by doing behavioral research we’re becoming an authoritarian society because we’re putting power in the hands of the people who can afford to do the research. He proposed enabling communities to do their own studies and research. He proposed putting data scientists into towns across the US, pool their research, and challenge their findings. But this was before the PC. Now it’s far more feasible.

Q: What sort of pushback have you gotten from communities?

A: Some decide not to work with us. In others, there’s contention about the shape of the project. Platforms have changed how they view this work. Three years ago, the platforms felt under siege and wounded. That’s why I decided to create an independent organization. The platforms have a strong incentive to protect their reputations.

Comments Off on [liveblog] Nathan Matias on The Social impact of real-time algorithm decisions

October 19, 2017

[liveblog] AI and Education session

Jenn Halen, Sandra Cortesi, Alexa Hasse, and Andres Lombana Bermudez of the Berkman Klein Youth and Media team are leading about a discussion about AI and Education at MIT Media Lab as part of the Ethics and Governance of AI program jointly at the Harvard’s Berkman Klein Center for Internet & Society and the MIT Media Lab.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Sandra gives an introduction the BKC Youth and Media project. She points out that their projects are co-designed with the groups that they are researching. From the AI folks they’d love ideas and better understanding of AI, for they are just starting to consider the importance of AI to education and youth. They are creating a Digital Media Literacy Platform (which Sandra says they hope to rename).

They show an intro to AI designed to be useful for a teacher introducing the topic to students. It defines, at a high level, AI, machine learning, and neural networks. They also show “learning experiences” (= “XP”) that Berkman Klein summer interns came up with, including AI and well-being, AI and news, autonomous vehicles, and AI and art. They are committed to working on how to educate youth about AI not only in terms of particular areas, but also privacy, safety, etc., always with an eye towards inclusiveness.

They open it up for discussion by posing some questions. 1. How to promote inclusion? How to open it up to the most diverse learning communities? 2. Did we spot any errors in their materials? 3. How to reduce the complexity of this topic? 4. Should some of the examples become their own independent XPs? 5. How to increase engagement? How to make it exciting to people who don’t come into it already interested in the topic?

[And then it got too conversational for me to blog…]

Comments Off on [liveblog] AI and Education session

October 10, 2017

[liveblog][bkc] Algorithmic fairness

I’m at a special Berkman Klein Center Tuesday lunch, a panel on “Programming the Future of AI: Ethics, Governance, and Justice” with Cynthia Dwork, Christopher L. Griffin, Margo I. Seltzer, and Jonathan L. Zittrain, in a discussion moderated by Chris Bavitz.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

They begin with brief intros of their interests:

Chris Griffin: One of the big questions for use of algorithms in the justice system is what: is the alternative? Human decision making has its own issues.

Margo Seltzer: She’s been working on transparent models. She would always prefer to be able to get an index card’s worth of explanation of how a machine learning system has come up with its output.

Cynthia Dwork: What is our definition of fairness, and how might we evaluate the fairness of our machine systems? She says she’s not that big a fan of insisting on explanations.

Jonathan Zittrain: What elements of this ought to be contracted out? Can we avoid the voting machine problem of relying on a vendor we don’t necessarily trust? Also, it may be that expalantions don’t help us that much. Also, we have to be very wary of biases built into the data. Finally, AI might be able to shed light on interventions before problems arise, e.g., city designs that might lower crime rates.

Chris Bavitz: Margo, say more about transparency…

Seltzer: Systems ought to be designed so that if you ask why it came up with that conclusion, it can tell you in a way that you can understand. Not just a data dump.

Bavitz: The legal system generally expects that, but is that hard to do?

Seltzer: It seems that in some cases you can achieve higher accuracy with models that are not explicable. But not always.

Dwork: Yes.

Zittrain: People like Cynthia Rudin have been re-applying techniques from the 1980s but are explainable. But I’ve been thinking about David Weinberger’s recent work [yes, me] that reality may depend on factors that are deeply complex and that don’t reduce down to understandable equations.

Dwork: Yes. But back to Margo. Rule lists have antecedents and probabilities. E.g., you’re trying to classify mushrooms as poisonous or not. There are features you notice: shape of the head, odor, texture, etc. You can generate rules lists that are fairly simple: if the stalk is like this and the smell is like, then it’s likely poisonous. But you can also have “if/else” conditions. The conclusions can be based on very complex dependencies among these factors. So, the question of why something was classified some way can be much more complicated than meets the eye.

Seltzer: I agree. Let’s say you were turned down for the loan. You might not be able to understand the complex of factors, but you might be able to find a factor you can address.

Dwork: Yes, but the question “Is there a cheap and easy path that would lead to a different outcome?” is a very different quesiton than “Why I was classified some particular way?””

Griffin: There’s a multi-level approach to assessing transparency. We can’t expect the public to understand the research by which a model is generated. But how is that translated into scoring mechanisms? What inputs are we using? If you’re assessing risk from 1 to 6, does the decision-maker understand the difference between, say, a 2 and 3?

Zittrain: The data going in often is very reductive. You do an interview with a prisoner who doesn’t really answer so you take a stab at it … but the stabbiness of that data is not itself input. [No, Zittrain did not say “stabbiness”].

Griffin: The data quality issue is widespread. In part this is because the data sets are discrete. It would be useful to abstract ID’s so the data can be aggregated.

Zittrain: Imagine you can design mushrooms. You could design a poisonous one with the slightest variation from edible ones to game the system. A real life example: the tax system. I think I’d rather trust machine learning than a human model that can be more easily gamed.

Bavitz: An interviewer who doesn’t understand the impact of the questions she’s asking might be a feature, not a bug, if you want to get human bias out of the model…

Seltzer: The suspicion around machine algorithms stems from a misplaced belief that humans are fair and unbiased. The combination of a human and a machine, if the human can understand the machine’s model, might result in less biased decisions than either on their own.

Bavitz: One argument for machine learning tools is consistency.

Griffin: The ethos of our system would be lost. We rely on a judicial official to use her or his wisdom, experience, and discretion to make decisions. “Bias could be termed as the inability to perceive with sufficient clarity.” [I missed some of this. Sorry.]

Bavitz: If the data is biased, can the systems be trained out of the bias?

Dwork: Generally, garbage in, garbage out. There are efforts now, but they’re problematic. Maybe you can combine unbiased data with historical data, and use that to learn models that are less biased.

Griffin: We’re looking for continuity in results. With the prisoner system, the judge gets a list of the factors lined up with the prisoner’s history. If the judge wants to look at that background and discard some of the risk factors because they’re so biased, s/he can ignore the machine’s recommendation. There may be some anchoring bias, but I’d argue that that’s a good thing.

Bavitz: How about the private, commercial actors who are providing this software? What if these companies don’t want to make their results interpretable so as not to give away their special sauce?

Dwork: When Facebook is questioned, I like to appeal to the miracle of modern cryptography that lets us prove that secrets have particular properties without decrypting them. This can be applied to algorithms so you can show that one has a particular property without revealing that algorithm itself. There’s a lot of technology out there that can be used to preserve the secrecy of the algorithm, if that were the only problem.

Zittrain: It’d be great to be able to audit a tech while keeping the algorithm secret, but why does the company want to keep it secret? Especially if the results of the model are fed back in, increasing lock-in. I can’t see why we’d want to farm this out to commercial entities. But that hasn’t been on the radar because entrepreneurial companies are arising to do this for municipalities, etc.

Seltzer: First, the secrecy of the model is totally independent from the business model. Second, I’m fine with companies building these models, but it’s concerning if they’re keeping the model secret. Would you take a pill if you had no idea how it worked?

Zittrain: We do that all the time.

Dwork: That’s an example of relying on testing, not transparency.

Griffin: Let’s say we can’t get the companies to reveal the algorithms or the research. The public doesn’t want to know (unless there’s litigation over a particular case) the reasoning behind the decision, but whether it works.

Zittrain: Assume re-arrest rates are influenced by factors that shouldn’t count. The algorithm would reflect that. What can we do about that?

Griffin: The evidence is overwhelming about the disparity in stops by race and ethnicity. The officers are using the wrong proxies for making these decisions. If you had these tools throughout the lifespan of such a case, you might be able to change this. But these are difficult issues.

Seltzer: Every piece of software has bugs. The thought of sw being used in way where I don’t know what it thinks it’s doing or what it’s actually doing gives me a lot of pause.

Q&A

Q: The government keeps rehiring the same contractors who fail at their projects. The US Digital Service insists that contractors develop their sw in public. They fight this. Second, many engineering shops don’t think about the bias in the data. How do we infuse that into companies?

Dwork: I’m teaching it in a new course this semester…

Zittrain: The syllabus is secret. [laughter]

Seltzer: We inject issues of ethics into our every CS course. You have to consider the ethics while you’re designing and building the software. It’s like considering performance and scalability.

Bavitz: At the Ethics and Governance of AI project at the Berkman Klein Center, we’ve been talking about the point of procurement: what do the procurers need to be asking?

Q: The panel has talked about justice, augmenting human decision-making, etc. That makes it sound like we have an idea of some better decision-making process. What is it? How will we know if we’ve achieved it? How will models know if they’re getting it right, especially over time as systems get older?

Dwork: Huge question. Exactly the right question. If we knew who ought to be treated similarly to whom for any particular classification class, everything would become much easier. A lot of AI’s work will be discovering this metric of who is similar to whom, and how similar. It’s going to be an imperfect but improving situation. We’ll be doing the best guess, but as we do more and more research, our idea of what is the best guess will improve.

Zittrain: Cynthia, your work may not always let us see what’s fair, but it does help us see what is unfair. [This is an important point. We may not be able to explain what fairness is exactly, but we can still identify unfairness.] We can’t ask machine learning pattern recognition to come up with a theory of justice. We have to rely on judges, legislators, etc. to do that. But if we ease the work of judges by only presenting the borderline cases, do we run the risk of ossifying the training set on which the judgments by real judges were made? Will the judges become de-skilled? Do you keep some running continuously in artesinal courtrooms…? [laughter]

Griffin: I don’t think that any of these risk assessments can solve any of these optimization problems. That takes a conversation in the public sphere. A jurisdiction has to decide what its tolerance for risk is, what it’s tolerance is for the cost of incarceration, etc. The tool itself won’t get you to that optimized outcome. It will be the interaction of the tool and the decision-makers. That’s what gets optimized over time. (There is some baseline uniformity across jurisdictions.)
Q: Humans are biased. Assume a normal distribution across degrees of bias. AI can help us remove the outliers, but it may rely on biased data.

Dwork: I believe this is the bias problem we discussed.

Q: Wouldn’t be better to train it on artificial data?

Seltzer: Where does that data come from? How do we generate realistic but unbiased data?

Comments Off on [liveblog][bkc] Algorithmic fairness

September 26, 2017

[liveblog][pair] Blaise Agüera y Arcas on the source of bias

At the PAIR Symposium, Google’s Blaise Agüera y Arcas is providing some intellectual and historical perspective on AI issues.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

[Note: This is a talk tough to live-blog because it is carefully structured intellectually. My apologies.]

He says neural networks have been part of the computing environment from the beginning. E.g., he thinks that the loop at the end of the logic gate symbol in fact comes from a 1943 symbolization of biological neural networks. There are indications of neural networks in Turing’s early papers. So these ideas go way back. Blaise thinks that the majority of computing processes in a few years will be running on processors designed for running neural networks.

ML has raised anxiety reminiscent of Walter Benjamin’s concern — he cites The Work of Art in the Age of Mechanical Reproduction — about the mass reproduction of art that strips it of its aura. Now there’s the same kind of moral panic about art and human exceptionalism and existence. (Cf. Nick Bostrom’s SuperIntelligence). It reminds him of Jakob Mohr’s 1910 The Influencing Machine in which schizophrenics believe they’re being influenced by an external machine. (They always thought men were managing the machine.) He points to what he calls Bostrom’s ultimate colonialism, in which we are able to populate the universe with 10^58 human minds. [Sorry, but I didn’t get this. My fault.] He ties this to Bacon’s reverence for the domination of nature. Blaise prefers a feminist view, citing Kember & Zylinksa’s Life After New Media.

Many say we have a value alignment problem, he says: how do we make AI that embeds human values? But AI systems do have human values because they’re trained on human data. The problem is that our human values are off. He references a paper on judging criminality based on faces. The paper claims it’s free of human biases. But it’s based on data that is biased. Nevertheless, this sort of tech is being commercialized. E.g., Faception claims to classify people based on their faces: High IQ, Pedophile, etc.

Also, there’s the recent paper about a ML system classifies one’s gender preferences based on faces. Blaise ran a test on Mechanical Turk asking about some of the features in the composite gay and straight faces in that paper. He found that people attracted to the same sex were more likely to wear glasses. There were also significant differences in facial hair, use of makeup, and face tan, features also in the composite faces. Thus, the ML system might have been using social markers, not physiognomy, “There are a lot of tells.”

In conclusion, none of these are arguments against ML. On the contrary. The biases and prejudices, and the social signalling, are things ML lets us hold a mirror up to.

2 Comments »

Next Page »