December 9, 2010
December 9, 2010
November 30, 2010
Brewster Kahle, Victoria Stodden, and Richard Cox are on a panel, chaired by the National Archive’s Director of Litigation Jason Baron. The conference is being put on by Princeton’s Center Internet for Technology Policy.
NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people. |
Brewster goes first. He’s going to talk about “public policy in the age of digital reproduction.” “We are in a jam,” he says, because of how we have viewed our world as our tech has change. Brewster founded the Internet Archive, a non-profit library. The aim is to make freely accessible everything ever published, from the Sumerian texts on. “Everyone everywhere ought to have access to it” — that’s a challenge worthy of our generation, he says.
He says the time is ripe for this. The Internet is becoming ubiquitous. If there aren’t laptops, there are Internet cafes. And there a mobiles. Plus, storage is getting cheaper and smaller. You can record “100 channel years” of HD TV in a petabyte for about $200,000, and store it in a small cabinet. For about $1,200, you could store all of the text in the Library of Congress. Google’s copy of the WWW is about a petabyte. The WayBack machine uses 3 petabytes, and has about 150 billion pages. It’s used by 1.5M/day. A small organization, like the Internet Archive, can take this task on.
This archive is dynamic, he says. The average Web page has 15 links. The average Web page changes every 100 days.
There are downsides to the archive. E.g., the WayBack Machine gets used to enable lawsuits. We don’t want people to pull out of the public sphere. “Get archived, go to jail,” is not a useful headline. Brewster says that they once got an FBI letter asking for info, which they successfully fought (via the EFF). The Archive gets lots of lawyer letters. They get about 50 requests per week to have material taken out of the Archive. Rarely do people ask for other people’s stuff to be taken down. Once, the Scientologists wanted some copyright-infringing material taken down from someone else’s archived site; the Archive finally agreed to this. The Archive held a conference and came up with Oakland Archive Policy for issues such as these.
Brewster points out that John Postel’s taxonomy is sticking: .com, .org, .gov, .edu, .mil … Perhaps we need separate policies for each of these, he says. And how do we take policy ideas and make them effective? E.g., if you put up a robots.txt exclusion, you will nevertheless get spidered by lots of people.
“We can build the Library of Alexandria,” he concludes, “but it might be problematic.”
Q: I’ve heard people say they don’t need to archive their sites because you will.
A: Please archive your own. More copies make us safe.
Q: What do you think about the Right to Oblivion movement that says that some types of content we want to self-destruct on some schedule, e.g. Facebook.
A: I have no idea. It’s really tough. Personal info is so damn useful. I wish we could keep our computers from being used against us in court; if we defined the 5th amendment so that who “we” are included our computers…
Richard Cox says if you gold, you know about info overload. It used to be that you had one choice of golf ball, Top-Flite. Now they have twenty varieties.
Archives are full of stories waiting to be told, he says. “When I think about Big Data…most archivists would think we’re talking about being science, corporate world, and government.” Most archivists work in small cultural, public institutions. Richard is going to talk about the shifting role of archivists.
As early as the 1940s, archivists were talking about machine-readable records. The debates and experiments have been going on for many decades. One early approach was to declare that electronic records were not archives, because the archives couldn’t deal with them. (Archivists and records managers have always been at odds, he says, because RM is about retention schedules, i.e., deleting records.) Over time, archivists came up to speed. By 2000, some were dealing with electronic records. In 2010, many do, but many do not. There is a continuing debate. Archivists have spent too long debating among themselves when they need to be talking with others. But, “archivists tend not to be outgoing folks.” (Archivists have had issues with the National Archives because their methods don’t “scale down.”)
There are many projects these days. E.g., we now have citizen archivists who maintain their own archives and who may contribute to public archives. Who are today’s archivists? Archival educators are redefining the role. Richard believes archives will continue, but the profession may not. He recommends reading the Clair report [I couldn’t get the name or the spelling, and can’t find it on Google :( ] on audio-visual archives. “I read it and I wept.” It says that we need people who understand the analog systems so that they can be preserved, but there’s no funding.
Victoria Stodden’s talk gloomy title is “The Coming Dark Ages in Scientific Knowledge.”
She begins by pointing to the pervasive use of computers and computational methods in the sciences, and even in the humanities and law schools. E.g., Northwestern is looking at the word counts in Shakespearean works. It’s changing the type of scientific analysis we’re doing. We can do very complicated simulations that give us a new way of understanding our world. E.g., we do simulations of math proofs, quite different from the traditional deductive processes.
This means what we’re doing as scientists is being stored in script, codes, data, etc. But science only is science when it’s communicated. If the data and scripts are not shared, the results are not reproducible. We need to act as scientists to make sure that this data etc. are shared. How do we communicate results based on enormous data sets? We have to give access to those data sets. And what happens when those data sets change (corrected or updated)? What happens to results based on the earlier sets? We need to preserve the prior versions of the data. How do we version it? How do we share it? How do we share it? E.g., There’s an experiment at NSF: All proposals have to include a data management plan. The funders and journals have a strong role to play here.
Sharing scientific knowledge is harder than it sounds, but is vital. E.g., a recent study showed that a cancer therapy will be particular effective based on individual genomes. But, it was extremely hard to trace back the data and code used to get this answer. Victoria notes that peer reviewers do not check the data and algorithms.
Why a dark age? Because “without reproducibility, knowledge cannot be recreated or understood.” we need ways and processes of sharing. Without this, we only have scientists making proclamations.
She gives some recommendations: (1) Assessment of the expense of data/code archiving. (2) Enforcement of funding agency guidelines. (3) Publication requirements. (4) Standards for scientific tools. (5) Versioning as a scientific principal. (6) Licensing to realign scientific intellectual property with longstanding scientific norms (Reproducible Research Standard). [verbatim from her slide] Victoria stresses the need to get past the hurdles copyright puts in the way.
Q: Are you a pessimist?
A: I’m an optimist. The scientific community is aware of these issues and is addressing them.
Q: Do we need an IRS for the peer review process?
A: Even just the possibility that someone could look at your code and data is enough to make scientists very aware of what they’re doing. I don’t advocate code checking as part of peer review because it takes too long. Instead, throw your paper out into the public while it’s still being reviewed and let other scientists have at it.
Q: [rick] Every age has lost more info than it has preserved. This is not a new problem. Every archivist from the beginning of time has had to cope with this.
Jason Baron of the National Archives (who is not speaking officially) points to the volume of data the National Archives (NARA) has to deal with. E.g., in 2001 32 million emails were transferred to NARA; in 2009, 250+ million archives were. He predicts there will be a billion presidential emails by 2017 held at NARA. The first lawsuit over email was filed in 1989 (email=PROFS). Right now, the official policy of 300 govt agencies is to print email out for archiving. We can no longer deal with the info flow with manual processes. Processing of printed pages occurs when there’s a lawsuit or a a FOIA request. Jason is pushing on the value of search as a way of encouraging systematic intake of digital records. He dreams of search algorithms that retrieve all relevant materials. There are clustering algorithms emerging within law that hold hope. He also wants to retrieve docs other than via key words. Visual analytics can help.
There are three languages we need: Legal, Records Management, and IT. How do we make the old ways work in the new? We need both new filtering techniques, but also traditional notions of appraisal. “The neutral archivist may serve as an unbiased resource for the filtering of information in an increasingly partisan (untrustworthy) world” [from the slide].
July 19, 2009
A friend asked me to post an explanation of what I meant when I said at PDF09 that “transparency is the new objectivity.” First, I apologize for the cliché of “x is the new y.” Second, what I meant is that transparency is now fulfilling some of objectivity’s old role in the ecology of knowledge.
Outside of the realm of science, objectivity is discredited these days as anything but an aspiration, and even that aspiration is looking pretty sketchy. The problem with objectivity is that it tries to show what the world looks like from no particular point of view, which is like wondering what something looks like in the dark. Nevertheless, objectivity — even as an unattainable goal — served an important role in how we came to trust information, and in the economics of newspapers in the modern age.
You can see this in newspapers’ early push-back against blogging. We were told that bloggers have agendas, whereas journalists give us objective information. Of course, if you don’t think objectivity is possible, then you think that the claim of objectivity is actually hiding the biases that inevitably are there. That’s what I meant when, during a bloggers press conference at the 2004 Democratic National Convention, I asked Pulitzer-prize winning journalist Walter Mears whom he was supporting for president. He replied (paraphrasing!), “If I tell you, how can you trust what I write?,” to which I replied that if he doesn’t tell us, how can we trust what he blogs?
So, that’s one sense in which transparency is the new objectivity. What we used to believe because we thought the author was objective we now believe because we can see through the author’s writings to the sources and values that brought her to that position. Transparency gives the reader information by which she can undo some of the unintended effects of the ever-present biases. Transparency brings us to reliability the way objectivity used to.
This change is, well, epochal.
Objectivity used be presented as a stopping point for belief: If the source is objective and well-informed, you have sufficient reason to believe. The objectivity of the reporter is a stopping point for reader’s inquiry. That was part of high-end newspapers’ claimed value: You can’t believe what you read in a slanted tabloid, but our news is objective, so your inquiry can come to rest here. Credentialing systems had the same basic rhythm: You can stop your quest once you come to a credentialed authority who says, “I got this. You can believe it.” End of story.
We thought that that was how knowledge works, but it turns out that it’s really just how paper works. Transparency prospers in a linked medium, for you can literally see the connections between the final draft’s claims and the ideas that informed it. Paper, on the other hand, sucks at links. You can look up the footnote, but that’s an expensive, time-consuming activity more likely to result in failure than success. So, during the Age of Paper, we got used to the idea that authority comes in the form of a stop sign: You’ve reached a source whose reliability requires no further inquiry.
In the Age of Links, we still use credentials and rely on authorities. Those are indispensible ways of scaling knowledge, that is, letting us know more than any one of us could authenticate on our own. But, increasingly, credentials and authority work best for vouchsafing commoditized knowledge, the stuff that’s settled and not worth arguing about. At the edges of knowledge — in the analysis and contextualization that journalists nowadays tell us is their real value — we want, need, can have, and expect transparency. Transparency puts within the report itself a way for us to see what assumptions and values may have shaped it, and lets us see the arguments that the report resolved one way and not another. Transparency — the embedded ability to see through the published draft — often gives us more reason to believe a report than the claim of objectivity did.
In fact, transparency subsumes objectivity. Anyone who claims objectivity should be willing to back that assertion up by letting us look at sources, disagreements, and the personal assumptions and values supposedly bracketed out of the report.
Objectivity without transparency increasingly will look like arrogance. And then foolishness. Why should we trust what one person — with the best of intentions — insists is true when we instead could have a web of evidence, ideas, and argument?
In short: Objectivity is a trust mechanism you rely on when your medium can’t do links. Now our medium can.
June 24, 2009
Nature News is twittering the Apollo 11 moon landing as a 40th-year commemoration. More here.
June 9, 2009
Lewis Hyde is giving a Berkman talk about the book he’s working on. The book is about the ownership of art and ideas, and argues that they should lie in a cultual commons, rather than be treated as property.
NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people. |
Lewis begins by talk about what a commons is. The term comes from medieval property ideas, and Lewis thinks of commons as a kind of property. He asks the group for a definition of property. Suggestions from the audience: “Exclusive rights.” “Anything I can use and have some degree of control over, not necessarily exclusively.” Lewis says that a 1900 dictionary defines property as that over which one has “rights of action.” Property is a bundle of rights of action. Lewis likes this definition because it includes human actors, Blackstone defines property rights in maximalist terms: the right to exclude the entire universe. Scalia also thinks property is the right to exclude. Lewis thinks the right to exclude is one of the bundle, not the whole thing. This is because, he says, he’s interested in commons. (He notes that in medieval times, “common” could be used as a verb. E.g., “a man may commons in the forest.)
Lewis talks about Hardin’s “The Tragedy of the Commons” essay. In fact, traditionally commons had governance rules to prevent the destruction of the commons’ asset, including the right of exclusion. “Commons were in fact not tragic. They lasted for millennia in Europe. Not tragic because they were rule-governed and stinted.” Why has the phrase “The tragedy of the commons” persisted? In part, because the phrase is catchy. In part because Hardin proposed it during the Cold War and it was taken as showing that common-ism doesn’t work.
There used to be an annual ritual of “beating the bounds,” to keep any gradual encroachment on the commons. “These were convivial affairs.” Lewis wonders if there are ways we can recover this resistance to encroachment.
Applied to the cultural realm, Lewis thinks cultural products are by nature in a commons. In the 18th century you get the idea that we could own poems, novels, etc. Until then, people thought of property as applying only to land. If something is not excludable, there’s no property in it. Many argued in the 18th century that therefore artistic works can’t be property. (Lewis recommends Terry Fisher’s article on philosophies of property. Terry points to four : Labor, moral rights, commercial utilitarianism, and civic utilitarianism.)
The first copyright law was in 1710 (Statute of Anne). By giving authors and publishers rights, it removed the “in perpetuity” of the crown’s monopolistic grants. It also created the public domain by creating a clear limit on the term of ownership: After 14 years, it enters the public domain. It’s as if the commons is the default state, says Lewis.
Jamie Boyle talks about the “second enclosure” in which everything is copyrighted by default, the term is extended. The second enclosure is an enclosure of the mind, says Boyle. Lewis now thinks there might be a third enclosure: The enclosure of wilderness of the mind. Lewis agrees that it makes sense to let the creator of a work, say a novel, get rewarded for it. “I wrote it, so it’s mine.” But, asks Lewis, what does the “I” mean? What is the self? He cites a 12th century Buddhist: “We study the self to forget the self.” To forget the self is to wake up to the world around you. Creativity comes out of self-abnegation. To get to something truly new, you have to a door open to the unknown. We usually think that the outside of owned property is the public domain. But that’s a domesticated sphere, things we are familiar with. There’s a old tradition that during the period of maturation, you have to leave the known world, go away from where instruction is given, and become familiar with your ignorance. (Lewis says he’s drawing on Thoreau.)
He takes an example from Jonathan Zittrain. When the Apple II came out, there was a spurt in sales because the first spreadsheet emerged, something that had not been expected. If you want a generative Internet, you have to be careful about what you lock down. Another example: In the 1980s, San Diego cell biologists patented a sequence of amino acids. They didn’t know its biological purpose. Ten years later, other researchers think that that sequence blocks blood to tumors. The patent owners sued the researchers. The patent gums up the system. Exploratory science goes into the unknown. “To enclose wilderness means giving property rights in areas where we as yet have no understanding what’s happening.” Lewis adds: “This makes no sense.” Lewis would like us to restore the idea that there are things that are unowned.
Emblematic of the third enclosure is silence. John Cage in 1952 came to Harvard to see/hear a completely soundproofed room. But Cage could hear a low rumbling and high whining. The low rumbling is the sound of your blood and the high whining is the sound of your nervous system. Silence for Cage meant not no sound but non-intention. He composed “4 mins and 33 seconds” which is a stretch of silence. The audience hears the ambient noise. In 2002 a rock group called the Planets put in a minute of silence. As a joke/homage, they credited it to Cage. The royalty-collecting societies started to send checks to Cage’s publisher. The publisher sued for copyright infringement on moral rights grounds (i.e., misattribution). They settled. But Cage held a Buddhist-like view of artistic creation. He tried to remove the self. A lot of copyright law assumes the work contains the imprint of the author’s personality. That’s one of the reasons we give a copyright. But those laws can get in the way of our ability to live in the wilderness, i.e., the third enclosure. How do you become a creator in a world in which scientists can patent unknown sequences and silence can be copyrighted?
Q: Maybe part of the problem in defending the commons is that we say we’re defending freedom, not as in free beer. Fighting for free beer is more compelling than fighting for free speech.
A: Beating the bounds was a fun event. So, yes, people have to want to do this.
Q: [me] How do we counter the fairness argument: If I did it, I ought to get the reward. How do we respond to that?
A: It’s hard to do this in political debate because it’s a long argument. I raise the question of the “I”: To what extent is my contribution really from me? With cultural works, you’re working in a vast sea of existing material. What you create is not entirely yours. Even if it becomes popular and useful, it’s other people who made it so. You can also point to the utilitarian consequences: The public interest is advanced by enabling things to enter the public domain.
Q: [jason] You’re making a creativity defense, i.e., that the commons is generative. But, if we take Cage or Thoreau to heart and say that true creativity consists of transcending the self, could we say that that leads to saying all works should be owned, so that you’re forced to create something new?
A: The puzzle is how much you can actually go to the wilderness. You can face it, but there’s no way to escape the world you come out of. Thoreau has The Iliad with him. There’s no way to escape the known. You always work from materials you’ve collected elsewhere.
Q: [ethanz] What’s so bad about private property? You’re hearkening back to a romantic conception that worked for a very small set of people. We’ve got an enormous amount of development vased on increasingly strong enclosure movements. Those movements have given us a great deal of what we love. Despite the first and second enclosures, creativity seems not to have been much hindered. Why should we worry about the third enclosure? Couldn’t we say that you’re attempting to protect and defend something that most of us have not experienced? How do we know that your romantic vision is superior to the world we’re interacting with?
A: I’m not against private property. The question is always where the lines should be drawn. I think we’ve extended the right to exclude too far. Yes, the world is quite creative. But we don’t know what we’re missing. With the enclosing of wilderness, we’re enclosing that which we don’t know about. Researchers are reluctant to do certain kinds of work, for fear of being sued.
Ethan: My diabetes medicine — recombinant DNA — exists because Eli Lilly worked within enclosures. How do we know we would have made the same progress if those enclosures weren’t there?
A: Let’s leave that hanging as a question. It’s a good question. You’re right that the existing dominant system has produced remarkable results.
Q: Michael Heller in The Gridlock Economy goes through the economic models that explain what we lose by locking stuff down. What’s the cultural loss?
A: Lessig and others write books about this…
May 18, 2009
At first sight, the images at the Nano GigaPan blog look like fairly ordinary electron microscope photos. But notice the zoom button.
Here’s an ant. Here’s some blood and hair.
May 7, 2009
My interview with Stephen Wolfram about WolframAlpha is now available. Some other me-based resources:
The unedited version weighs in at a full 55 minutes. The edited version will spare you some of my throat-clearing, and some dumb questions.
A post about what I think the significance of WolframAlpha will be.
Live blog of Wolfram’s presentation at Harvard.
Wolfram’s presentation at Harvard.
May 4, 2009
The Independent calls WolframAlpha “An invention that could change the Internet forever.” It concludes: “Wolfram Alpha has the potential to become one of the biggest names on the planet.”
Nova Spivak, a smart Semantic Web guy, says it could be as important as Google.
Ton Zijlstra, on the other hand, who knows a thing or two about knowledge and knowledge management, feels like it’s been overhyped. After seeing the video of Wolfram talking at Harvard, Ton writes:
No crawling? Centralized database, adding data from partners? Manual updating? Adding is tricky? Manually adding metadata (curating)? For all its coolness on the front of WolframAlpha, on the back end this sounds like it’s the mechanical turk of the semantic web.
(“The mechanical turk of the semantic web.” Great phrase. And while I’m in parentheses, ReadWriteWeb has useful screenshots of WolframAlpha, and here’s my unedited 55-minute interview with Wolfram.)
I am somewhere in between, definitely over in the Enthusiastic half of the field. I think WolframAlpha [WA] will become a standard part of the Internet’s tool set, but is not transformative.
WA works because it’s curated. Real human beings decide what topics to include (geography but not 6 Degrees of Courtney Love), which data to ingest, what metadata is worth capturing, how that metadata is interrelated (= an ontology), which correlations to present to the user when she queries it (daily tonnage of fish captured by the French compared to daily production of garbage in NYC), and how that information should be presented. Wolfram insists that an expert be present in each data stream to ensure the quality of the data. Given all that human intervention, WA then performs its algorithmic computations … which are themselves curated. WA is as curated as an almanac.
Curation is a source of its strength. It increases the reliability of the information, it enables the computations, and it lets the results pages present interesting and relevant information far beyond the simple factual answer to the question. The richness of those pages will be big factor in the site’s success.
Curation is also WA’s limitation. If it stays purely curated, without areas in which the Big Anyone can contribute, it won’t be able to grow at Internet speeds. Someone with a good idea — provide info on meds and interactions, or add recipes so ingredients can be mashed up with nutritional and ecological info — will have to suggest it to WolframAlpha, Inc. and hope they take it up. (You could to this sorta kinda through the API, but not get the scaling effects of actually adding data to the system.) And WA will suffer from the perspectival problems inevitable in all curated systems: WA reflects Stephen Wolfram’s interests and perspective. It covers what he thinks is interesting. It covers it from his point of view. It will have to make decisions on topics for which there are no good answers: Is Pluto a planet? Does Scientology go on the list of religions? Does the page on rabbits include nutritional information about rabbit meat? (That, by the way, was Wolfram’s example in my interview of him. If you look at the site from Europe, a “rabbit” query does include the nutritional info, but not if you log in from a US IP address.) But WA doesn’t have to scale up to Internet Supersize to be supersized useful.
So, given those strengths and limitations, how important is WA?
Once people figure out what types of questions it’s good at, I think it will become a standard part of our tools, and for some areas of inquiry, it may be indispensable. I don’t know those areas well enough to give an example that will hold up, but I can imagine WA becoming the first place geneticists go when they have a question about a gene sequence or chemists who want to know about a molecule. I think it is likely to be so useful within particular fields that it becomes the standard place to look first…Like IMDB.com for movies, except for broad, multiple fields, with the ability to cross-compute.
But more broadly, is WA the next Google? Does it transform the Internet?
I don’t think so. Its computational abilities mean it does something not currently done (or not done well enough for a crowd of users), and the aesthetics of its responses make it quite accessible. But how many computational questions do you have a day? If you want to know how many tons of fish France catches, WA will work as an almanac. But that’s not transformational. If you want to know how many tons divided by the average weight of a French person, WA is for you. But the computational uses that are distinctive of WA and for which WA will frequently be an astounding tool are not frequent enough for WA to be transformational on the order of a Google or Wikipedia.
There are at least two other ways it could be transformational, however.
First, its biggest effect may be on metadata. If WA takes off, as I suspect it will, people and organizations will want to get their data into it. But to contribute their data, they will have to put it into WA’s metadata schema. Those schema then become a standard way we organize data. WA could be the killer app of the Semantic Web … the app that gives people both a motive for putting their data into ontologies and a standardized set of ontologies that makes it easy to do so.
Second, a robust computational engine with access to a very wide array of data is a new idea on the Internet. (Ok, nothing is new. But WA is going to bring this idea to mainstream awareness.) That transforms our expectations, just as Wikipedia is important not just because it’s a great encyclopedia but because it proved the power of collaborative crowds. But, WA’s lesson — there’s more that can be computed than we ever imagined — isn’t as counter-intuitive as Wikipedia’s, so it is not as apple-cart-upsetting, so it’s not as transformational. Our cultural reaction to Wikipedia is to be amazed by what we’ve done. With WA, we are likely to be amazed by what Wolfram has done.
That is the final reason why I think WA is not likely to be as big a deal as Google or Wikipedia, and I say this while being enthusiastic — wowed, even — about WA. WA’s big benefit is that it answers questions authoritatively. WA nails facts down. (Please take the discussion about facts in a postmodern age into the comments section. Thank you.) It thus ends conversation. Google and Wikipedia aim at continuing and even provoking conversation. They are rich with links and pointers. Even as Wikipedia provides a narrative that it hopes is reliable, it takes every opportunity to get you to go to a new page. WA does have links — including links to Wikipedia — but most are hidden one click below the surface. So, the distinction I’m drawing is far from absolute. Nevertheless, it seems right to me: WA is designed to get you out of a state of doubt by showing you a simple, accurate, reliable, true answer to your question. That’s an important service, but answers can be dead-ends on the Web: you get your answer and get off. WA as question-answerer bookends WA’s curated creation process: A relatively (not totally) closed process that has a great deal of value, but keeps it from the participatory model that generally has had the biggest effects on the Net.
Providing solid, reliable answers to difficult questions is hugely valuable. WolframAlpha’s approach is ambitious and brilliant. WolframAlpha is a genius. But that’s not enough to fundamentally alter the Net.
Nevertheless, I am wowed.
April 28, 2009
Stephen Wolfram is giving at talk at Harvard/Berkman about his WolframAlpha site, which will launch in May. Aim: “Find a way to make computable the systematic knowledge we’ve accumulated.” The two big projects he’s worked on have made this possible. Mathematica (he’s worked on it for 23 yrs) makes it possible to do complex math and symbolic language manipulation. A New Kind of Science (NKS) has made it possible that it’s possible to understand much about the world computationally, often with very simple rules. So, WA uses NKS principles and the Mathematica engine. He says he’s in this project for the long term.
NOTE: Live-blogging.Posted without re-reading Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people. |
You type in a question and you get back in answers. You can type in math and get back plots, etc. Type in “gdp france” and get back the answer, a graph of the history of the shows histogram of GDP.
“GDP of france / italy”: The GDP of France divided by the GDP of Italy
“internet users in europe” shows histogram, list of highest and lowers, etc.
“Weather in Lexington, MA” “Weather lexington,ma 11/17/92” “Weather lexington, MA moscow” shows comparison of weather and location.
“5 miles/sec” returns useful conversions and comparisons.
“$17/hr” converts to per week, per month, etc., plus conversion to other currencies.
“4000 words” gives a list of typical typing speeds, the length in characters, etc.
“333 gm gold” gives the mass, the commodity price, the heat capacity, etc.
“H2S04” gives an illustration of the molecule, as well as the expected info about mass, etc.
“Caffeine mol wt/ water” gives a result of moelcular weights divided.
“decane 2 atm 50 C” shows what decane is like at two atmospheres and at 50 C, e.g., phase, density, boiling point, etc.
“LDL 180”: Where your cholesterol level is against the rest of the population.
“life expctancy male age 40 italy”: distribution of survival curve, history of that life expectancy over time. Add “1933” and adds specificity.
“5’8″ 160 lbs”: Where in the distribution of body mass index
“ATTGTATACTAA”: Where that sequence matches the human genome
“MSFT”: Real time Microsoft quote and other financial performance info. “MSFT sun” assumes that “sun” refers to stock info about Sun Microsystems.
“ARM 20 yr mortgage”: payment of monthly tables, etc. Let’s you input the loan amount.
“D# minor”: Musical notation, plays the D# minor scale
“red + yellow”: Color swatch, html notation
“www.apple.com”: Info about Apple, history of page views
“lawyers”: Number employed, average wage
“France fish production”: How many metric tons produced, pounds per second, which is 1/5 the rate trash is produced in NYC
“france fish production vs. poland”: charts and diagrams
“2 c orange juice”: nutritional info
“2 c orange juice + 1 slice cheddar cheese”: nutritional label
“a__a__n”: English words that match
“alan turing kurt godel”: Table of info about them
“weather princeton, day when kurt godel died”: the answer
“uncle’s uncle’s grandson’s grandson”: family tree, probabiilty of those two sharing genetic material
“5th largest country in europe”
“gdp vs. railway length in europe”:
“hurricane andrew”: Data, map
“andrew”: Popularity of the name, diagrammed.
“president of brazil in 1922”
“tide NYC 11/5/2015”
“ten flips 4 heads”: probability
“3,7,15,31,63…”: Figures out and plots next in the sequence and possible generating function
“4,1 knot”: diagram of knot
“next total solar eclipse chicago”: Next one visible in Chicago
“ISS”: International Space Station info and map
It lets you select alternatives in case of ambiguities.
“We’re trying to compute things.” We have tools that let us find things. But when you have a particular question, it’s unlikely that you’ll find that specific answer written down. WA therefore tries to compute answers. “The objective is to reach expert level knowledge across a very wide range of domains.”
Four big pieces to WA:
1. Data curation. WA has trillions of people of curated data. It gets it from free data or licensed data. Partially human partially automated system cleans it up and tries to correlate it. “A lot can be done automatically…At some point, you need a human domain expert in the middle of it.” There are people inside the company and a network of others who do the curation.
2. The algorithms. Take equations, etc., from all over. “There are finite numbers of methods that have been discovered in the history of science.” There are 5-6 millions lines of Mathematica code at work.
3. Linguistic analysis to understand the inputs. “There’s no manual, no documentation. You get to interact it with just how you think about things.” They’re doing the opposite of natural language processing which usually tries to understand millions of pages. WA’s problem is mapping a relatively small set of short human inputs to what the system knows about. NKS helps with this. It turns out that ambiguity is not nearly as big a problem as we thought.
4. Automated presentation. What do yo show people so they can cognitively grasp it? “Algorithmic presentation technology … tries to pick out what is important.” Mathematica has worked on “computational aesthetics” for years.
He says that have at least a reasonable start on about 90% of the shelves in a typical reference library.
Q: (andy orem) What do you do about the inconsistencies of data? We don’t know how inconsistent it was and what algorithms you used.
A: We give source info. “We’re trying to create an authoritative source for data.” We know about ranges of values; we’ll make that information available. “But by the time you have a lot of footnotes on a number, there’s not a lot you can do with that number.” “We do try to give footnotes.”
Q: How do you keep current?
A: Lots of people want to make their data available. We hope to make a streamlined, formalized way for people to contribute the data. We want to curate it so we can stand by it.
Q: [me] Openness? Of API, of metadata, of contributions of interesting comparisons, etc.
A: We’ll do a variety of levels of API. First: presentation level: put output on their pages. Second, XML-level so people can mash it up. Third level: individual results from the databases and from the computations. [He shows a first draft of the api] You can get as the symbolic expressions that Mathematica is based on. We hope to have a personalizable version. Metadata: When we open up our data repository mechanisms so people can contribute, some of our ontology will be exposed.
How about in areas where people disagree? If a new universe model comes out from Stanford, does someone at WolframAlpha have to say yes and put it in?
A: Yes
Q: How many people?
A: It’s been 150 for a long time. Now it’s 250. It’s probably going to be a thousand people.
Q: Who is this for?
A: It’s for expert knowledge for anyone who needs it.
Q: Business model?
A: The site will be free. Corporate sponsors will put ads on the side. We’re trying to figure out how to ingest vendor info when it’s relevant, and how to present it on the site. There will also be a professional version for people who are doing a lot of computation, want to put in their own data…
Q: Can you define the medical and population databases to get the total mass of people in England.
A: We could integrate those databases, but we don’t have that now. We’re working on “splat pages” you get when it doesn’t work. It should tell you what it does know.
Q: What happens when there is no answer, e.g., 55th largest state in the US?
A: It says it doesn’t know.
Q: [eszter] For some data, there are agreed-upon sources. For some there aren’t. How do you choose sources?
A: That’s a key problem in doing data curation. “How do we do it? We try to do the best job we can.” Use experts. Assess. Compare. [This is a bigger issue than Wolfram apparently thinks where data models are political. E.g., Eszter Hargittai, who is sitting next to me, points out “How many Internet users are there?” is a highly controversial question.] We give info about what our sources are.
Q: Technologically, where do you want to focus in the future?
A: All 4 areas need to be pushed forward.
Q: How does this compare to the Semantic Web?
A: Had the Web already had been semantically tagged, this product would have been far far easier, although keep in mind that much of the data in WA comes from private databases. We have a sophisticated ontology. We didn’t create the ontology top-down. It’s mostly bottom-up. We have domains. We have ontologies for them. We merge them together. “I hope as we expose some of our data repository methods, it will make it easier to do some Semantic Web kind of things. People will be able to line data up.”
Q: When can we look at the formal specifications of these ontologies? When can we inject our own?
A: It’s all represented in clean Mathematica code. Knitting new knowledge into the system is tricky because our UI is natural language, which is messy. E.g., “There’s a chap who goes by the name Fifty Cent.” You have to be careful.
Q: What reference source tells you if Palestine exists…?
A: In cases like this, we say “Assuming Case A or B.” There are holes in the data. I’m hoping people will be motivated to fill them in. Then there’s the question of the extent to which we can build expert communities. We don’t know the best way to do this. Lots of interesting ideas.
How about pop culture?
A: Pop culture info is much shallower computationally. (“Britney Spears” just gets her name, birthdate, and birthplace. No music, no photos, nothing about her genre, etc.) (“Meaning of life” does answer “42”)
Q: Compare with CYC? (A common sense reasoning system)
A: CYC deals with human reasoning. That’s not the best method for figuring out physics, etc. “We can do the non-human parts of reasoning really well.”
Q: [couldn’t hear the question]
A: The best way to debug it is not necessarily to inspect the code but to inspect the results. People reading code is less efficient than automated systems.
Q: Will it be integrated into Mathematica?
A: A future version will let you type WA data into Mathematica.
Q: How much work do you have to do on the NLP sound? Your searches used a special lexicon…
A: We don’t know. We have a daily splat call to see what types of queries have failed. We’re pretty good at removing linguistic fluff. People drop the fluff pretty quickly after they’ve been using WA for a while.
Q: (free software foundation) How does this change the landscape for open access? There’s info in commercial journals…
A: When there’s a proprietary database, the challenge is making the right deals. People will not be able to take out of our system all the data that we put into it. We have yet to learn all of the issues that will come up.
Q: Privacy?
A: We’re dealing with public data. We could do people search, but, personally, I don’t want to.
Q: What would you think of a more Wikipedia-like model? Do you worry about a competitor making a wiki data that is completely open and grows faster?
A: That’d be great. Making WA is hard. It’s not just a matter of shoveling data in. Wikipedia is fantastic and I use it all the time, but it’s gone in particular directions. When you’re looking for systematic data there, even if people put in systematic data — e.g., 300 pages about chemicals — over the course of time, the data gets dirty. You can’t compute from it.
Q: How about if Google starts presenting your results in response to queries?
A: We’re looking for synergies But we’re generating these on the fly; it won’t get indexed.
Q: I wonder how universities will find a place for this.
A: Very interesting question. Generating hard data is hard and useful, although universities often prefer higher levels of synthesis and opinion. [Loose paraphrase!] Leibniz had this nailed: Take any human argument and find a way to mechanically compute it.
February 27, 2009