Joho the Blogxml Archives - Joho the Blog

February 20, 2011

[2b2k] Public data and metadata, Google style

I’m finding Google Lab’s Dataset Publishing Language (DSPL) pretty fascinating.

Upload a set of data, and it will do some semi-spiffy visualizations of it. (As Apryl DeLancey points out, Martin Wattenberg and Fernanda Viegas now work for Google, so if they’re working on this project, the visualizations are going to get much better.) More important, the data you upload is now publicly available. And, more important than that, the site wants you to upload your data in Google’s DSPL format. DSPL aims at getting more metadata into datasets, making them more understandable, integrate-able, and re-usable.

So, let’s say you have spreadsheets of “statistical time series for unemployment and population by country, and population by gender for US states.” (This is Google’s example in its helpful tutorial.)

  • You would supply a set of concepts (“population”), each with a unique ID (“pop”), a data type (“integer”), and explanatory information (“name=population”, “definition=the number of human beings in a geographic area”). Other concepts in this example include country, gender, unemployment rate, etc. [Note that I’m not using the DSPL syntax in these examples, for purposes of readability.]

  • For concepts that have some known set of members (e.g., countries, but not unemployment rates), you would create a table — a spreadsheet in CSV format — of entries associated with that concept.

  • If your dataset uses one of the familiar types of data, such as a year, geographical position, etc., you would reference the “canonical concepts” defined by Google.

  • You create a “slice” or two, that is, “a combination of concepts for which data exists.” A slice references a table that consists of concepts you’ve already defined and the pertinent values (“dimensions” and “metrics” in Google’s lingo). For example, you might define a “countries slice” table that on each row lists a country, a year, and the country’s population in that year. This table uses the unique IDs specified in your concepts definitions.

  • Finally, you can create a dataset that defines topics hierarchically so that users can more easily navigate the data. For example, you might want to indicate that “population” is just one of several characteristics of “country.” Your topic dataset would define those relations. You’d indicate that your “population” concept is defined in the topic dataset by including the “population topic” ID (from the topic dataset) in the “population” concept definition.

When you’re done, you have a data set you can submit to Google Public Data Explorer, where the public can explore your data. But, more important, you’ve created a dataset in an XML format that is designed to be rich in explanatory metadata, is portable, and is able to be integrated into other datasets.

Overall, I think this is a good thing. But:

  • While Google is making its formats public, and even its canonical definitions are downloadable, DSPL is “fully open” for use, but fully Google’s to define. Having the 800-lbs gorilla defining the standard is efficient and provides the public platform that will encourage acceptance. And because the datasets are in XML, Google Public Data Explorer is not a roach motel for data. Still, it’d be nice if we could influence the standard more directly than via an email-the-developers text box.

  • Defining topics hierarchically is a familiar and useful model. I’m curious about the discussions behind the scenes about whether to adopt or at least enable ontologies as well as taxonomies.

  • Also, I’m surprised that Google has not built into this standard any expectation that data will be sourced. Suppose the source of your US population data is different from the source of your European unemployment statistics? Of course you could add links into your XML definitions of concepts and slices. But why isn’t that a standard optional element?

  • Further (and more science fictional), it’s becoming increasingly important to be able to get quite precise about the sources of data. For example, in the library world, the bibliographic data in MARC records often comes from multiple sources (local cataloguers, OCLC, etc.) and it is turning out to be a tremendous problem that no one kept track of who put which datum where. I don’t know how or if DSPL addresses the sourcing issue at the datum level. I’m probably asking too much. (At least Google didn’t include a copyright field as standard for every datum.)

Overall, I think it’s a good step forward.

1 Comment »

September 1, 2010

OED goes paperless

The Oxford English Dictionary has announced that it will not print new editions on paper. Instead, there will be Web access and mobile apps.

According to the article in the Telegraph, “A team of 80 lexicographers has been working on the third edition of the OED – known as OED3 – for the past 21 years.”

It has been a long trajectory toward digitization for the OED. In the 1990s, the OED’s desire to produce a digital version (remember books on CD?) stimulated search engine innovation. To search the OED intelligently, the search engine would have to understand the structure of entries, so that it could distinguish the use of a word as that which is being defined, the use of it within a definition, the use of it within an illustrative quote, etc. SGML was perfect for this type of structure, and the Open Text SGML search engine came out of that research. Tim Bray [twitter:timbray] was one of the architects of that search engine, and went on to become one of the creators of XML. I’m going to assume that some of what Tim learned from the OED project was formative of his later thinking… (Disclosure: I worked at Open Text in the mid-1990s.)

On the other hand, initially, the OED didn’t want to attribute the origins of the word “blog” to Peter Merholz because he coined it in his own blog, and the OED would only accept print attributions. (See here, too.) the OED eventually got over this prejudice for printed sources, however, and gave Peter proper credit.

1 Comment »

July 2, 2010

Beginner-to-beginner: Parsing OPML, including level (= indents)

I am 99.9% confident that I’ve spent many hours trying to do in a complex way something that javascript (perhaps with jquery) lets us do with a single line of compressed code. But, I couldn’t find that line. Besides, I’m a hobbyist which means the puzzle is usually more fun than the solution. But this one drove me crazy.

I’m trying to read an OPML file. OPML is an instance of XML designed to record outlines. I found plenty of scripts online (javascript and php) that would read the contents, but none that would tell me the indentation level of each line. I’m sure there’s some simple way of figuring that out, but it beat the heck out of me. So, I finally found some javascript by Vivin Pallath to which I could add a couple of lines (mainly by trial and error) to record indentation info.

Vivin’s script is, appropriately recursive. That’s because outlines are arbitrarily nested, so you have to be able to call a “get the children” function on each line, each of which may have its own children. A recursive function calls itself, in this case until it reaches the end of a branch, at which point it pops up to the next sibling branch and carries on. You can easily record each branch in a flat list, but I also need to know how far down the limb we are. I find recursion to be extraordinarily difficult for my brain to follow, so more or less by trial and error I added a counter (a global variable, although I’m not sure that’s required), and then tried to figure out when it needs to be reset to zero.

The way I have it set up, the tree walking function looks for XML elements called “outline,” because that’s the tag OPML uses. It records the text content of the line in the outline as the attribute “text.” (OPML also can encode stuff other than text, but I happen not to care about that.) At each line, it records the text and the level number in a global array. This is a rather silly way of recording the info, but that’s the easy part to fix.

The parseOpml function consists mainly of code from Phpied that opens an xml file and reads it into an xml document object that javascript knows how to work with. The parseOPML function calls the walkTree function that does the actual parsing. The walkTree function takes as parameters the element from which it should start walking (which, in this case, is the root of the xml object itself) and the starting level of indentation (0). It wouldn’t be hard to add to it the name of the element you want it to pay attention to (which, in this case, is specified in walkTree to be “outline”>

Now for the caveats: I’ve barely tested this. It may break under predictable circumstances. It is very likely wildly inefficient (which doesn’t matter for my petty uses). It will undoubtedly embarrass actual developers. But, maybe it will help some other hobbyist…

// globals
var treeArray = new Array(); // array to hold results
var glevel; // the indent level

function parseOpml(){
  // opens the xml file and calls the tree walker

var fname="/pathname/to/your/file.xml";

// open the xml file (from var xmlDoc = null; if (window.ActiveXObject) {// code for IE xmlDoc = new ActiveXObject("Microsoft.XMLDOM"); } else // code for Mozilla, Firefox, Opera, etc. if (document.implementation.createDocument) { xmlDoc = document.implementation.createDocument("", "", null); } else { alert('Your browser cannot handle this script'); } if (xmlDoc != null) { xmlDoc.async = false; xmlDoc.load(fname); } // call the tree walker treeArray.length=0; // reset global array walkTree(xmlDoc,0);

} //--------------- Recursive tree walker --------------

function walkTree(node, glevel){ // walks tree looking for elements tagged "outline" // Replace the two instances of "outline" below to have it // look for other tags //Original function from: Vivin Pallath. Thanks! // //Modified to count indentation levels var n="d"; var a="";

if (node.hasChildNodes()) { node = node.firstChild; do{ // only return elements tagged "outline n = node.nodeName; if (n == "outline") { glevel++; // increase level // pushes the text and level as a string into a global array a = node.getAttribute("text"); // get the text treeArray.push(glevel + "=" + a); } walkTree(node,glevel); // recurse until no children node = node.nextSibling; // pop up to get next sibling // If this branch has no more children, reduce the level if ((n=="outline") && (!node.hasChildNodes())) { glevel = glevel - 1; } } while(node); } }

1 Comment »

February 18, 2008

Tim Bray on the history of XML

Tim Bray has a terrific piece on the development of XML, now in its tenth year as an official standard. He focuses on the people, not on the technicalities of the standard.

It’s worth it just to re-read Tim’s words about Yuri Rubinsky, an SGML advocate of enormous energy and passion, without a mean bone in his body. Tim puts it better. Yuri died way too young, and I miss him.

It’s also worth it to learn the off-the-mainstage history of XML, of course.

* * *

Tim is a terrific writer. And I’m happy to say that we’ve been friends for a long time (which is somewhere between disclosure and bragging). But, the one thing that put me off in his piece was his providing physical descriptions. I assume they’re accurate, and he writes them with flair, but they struck me as irrelevant. Why does it help me to know that someone is burly and someone else is buxom?

And yet, it does seem to help.

But maybe it shouldn’t. Maybe it only seems to help.

As you can tell, I’m torn by this. I’ve occasionally briefly described people in things I’ve written. But I don’t feel quite right about it

[Tags: ]