Joho the Blog » Data in its untamed abundance gives rise to meaning

Data in its untamed abundance gives rise to meaning

Seb Schmoller points to a terrific article by Google’s Alon Halevy, Peter Norvig, and Fernando Pereira about two ways to get meaning out of information. Their example is machine translation of natural language where there is so much translated material available for computers to learn from, which (they argue) works better than trying to learn from attempts that go up a level of abstraction and try to categorize and conceptualize the language. Scale wins. Or, as the article says, “But invariably, simple models and a lot of data trump more elaborate models based on less data.”


They then use this to distinguish the Semantic Web from “Semantic Interpretation.” The latter “deals with imprecise, ambiguous natural languages,” as opposed to aiming at data and application interoperability. “The problem of semantic interpretation remains: using a Semantic Web formalism just means that semantic interpretation must be done on shorter strings that fall between angle brackets.” Oh snap! “What we need are methods to infer relationships between column headers or mentions of entities in the world.” “Web-scale data” to the rescue! This is basic the same problem as translating from one language to another, given a large enough corpus of translations: We have a Web-scale collection of tables with column headers and content, so we should be able to algorithmically recognize clustering concordances of meaning.

I’m not doing the paper justice because I can’t, although it’s written quite clearly. But I find it fascinating. [Tags: ]

One Response to “Data in its untamed abundance gives rise to meaning”

  1. I read the article carefully. I admit it is fascinating. First, its title “The Unreasonable Effectiveness of Data” is fully intentional, admitted reference to “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” by Eugene Wigner. Nice play with titles, but it is a bit misleading. The role of mathematics in natural science is just the opposite to the role of pure data in human knowledge. I could elaborate on it longer, but – what strikes me deeply, is another thing.

    It seems that authors dismiss the message of Semantic Web advocates, among them, Tim Berners-Lee, for reasons that are not very clear.
    Let me cite: ” (…) But even if we have a formal Semantic Web “Company Name” attribute, we can’t expect to have an ontology for every possible value of this attribute. For example, we can’t know for sure what company the string “Joe’s Pizza” refers to because hundreds of businesses have that name and new ones are being added all the time. We also can’t always tell which business is meant by the string HP.(…)”

    Well, in all Semantic Web proposals we do not care what “Joe’s Pizza” or “HP” means!
    We care about one thing, that “Joe’s Pizza” is The Company Name. We do not need ontology for the name itself, we need it for different potential “Company Name” concept !!!

    Not everything there is plainly bad, though.

    What I liked was the call “So, follow the data” – in some vague sense they reaffirmed the principle of least action of Tim Berners Lee.

    I also must admit, that the distinction of “Semantic Web” from “Semantic Interpretation” is very convincing and it is another good part of the article.

    Finally, I often think, that Google would be The One who could push Semantic Web forward. And for some reason they don’t. They could simple cry out loudly: “Hi webmasters around the world – use RDF or microformats to mark your contact/author data and we will use it in our search engine!”

    Apart from conspiracy theories, there is something in this article, written by Google researches that justifies their unwillingness to start the ball rolling….

Leave a Reply


Web Joho only

Comments (RSS).  RSS icon

Switch to our mobile site