Joho the Blog » sgml

July 6, 2013

[misc][2b2k] Why ontologies make me nervous

A few days ago there was a Twitter back and forth between two people I deeply respect: Dan Brickley [twitter:danbri] and Ed Summers [twitter:edsu]. It started with Ed responding to a tweet about a brief podcast I did with Kevin Ford [twitter:3windmills], who is on the team working on BibFrame:

After a couple of tweets, Dan tweeted the following:


There followed some agreement that it's often helpful to have apps driving the development of standards. (Kevin agrees with this, and points to BibFrame's process.) But, Dan's comment clarified my understanding of why ontologies make me nervous.

Over the past hundred years or so, we've come to a general recognition that all classifications and categorizations are tools, not representations of The Real Order. The periodic table of the elements is a useful way of organizing information, and manifests real relationships among the elements, but it is not the single "real" way the elements are arranged; if you're an economist or an industrialist, a chart that arranges the elements based on where they exist on our planet might be just as valid. Likewise, Linneaus' classification scheme is useful and manifests some real relationships, but if you're a chef you might have a different way of carving up the animal kingdom. Linneaus chose to organize species based upon visible differences — which might not be the "essential" differences — so that his scheme would be useful to scientists in the field. Although he was sometimes ambiguous about this, he seems not to have thought that he was discerning God's own order. Since Linnaeus we have become much more explicit in our understanding that how we classify depends on what we're trying to accomplish.

For example, a DTD (document type definition) typically is designed not to capture the eternal essence of some type of document, but to make the document more usable by systems that automate the document's production and processing. For example, an industry might agree on a DTD for parts catalogs that specifies that a parts catalog must have an element called "part" and that a part must have a type, part number, length, height, weight, material, and a description, and optionally can note whether it turns clockwise or counterclockwise. Each of these elements would have a standard name (e.g., "part_number," not "part#"). The result is a document that describes parts in a standard way so that a company can receive descriptions from all of its suppliers and automatically build a database of the parts it uses.

A DTD therefore is designed with an eye toward what properties are going to be useful. In some industries, it might include a term that captures how shiny the part is, but if it's a DTD for surgical equipment, that may not be relevant enough to include...although "sanitary_packaging" might be. Likewise, how quickly a bolt transfers heat might seem irrelevant, at least until NASA places an order. In this DTD's are much like forms: You don't put a field for earlobe length in the college application form you're designing.

Ontologies are different. They can try to express the structure of a domain independent of any particular use, so that the widest variety of applications can share data, including apps from domains outside of the one that's been mapped. So, to use Dan's example, your ontology of jobs would note that jobs have employers and workers, that they may have a salary or other form of compensation, that they can be part-time, full-time, seasonal, etc. As an ontology designer, because you're trying to think beyond whatever applications you already can imagine, your aim (often, not always) is to provide the fullest possible set of slots just in case someone sometime needs that info. And you will carefully describe the relationships among the elements so that apps and researchers can use knowledge that is implicit in the model.

The line between DTD's and ontologies is fuzzy. Many ontologies are designed with classes of apps in mind, and some DTD's have tried to be hugely general purpose. My discomfort really comes down to a distrust of the concept of "knowledge representation" that underlies some ontologies (especially earlier ones). The complexity of the relationships among parts will always outstrip our attempts to capture and codify those relationships. Further, knowledge cannot be fully represented because it isn't a thing apart from our continuous invention, discovery, and engagement with it.

What it comes down to is that if you talk about ontologies as knowledge representations I'll mutter something under my breath and change the topic.

5 Comments »

September 1, 2010

OED goes paperless

The Oxford English Dictionary has announced that it will not print new editions on paper. Instead, there will be Web access and mobile apps.

According to the article in the Telegraph, “A team of 80 lexicographers has been working on the third edition of the OED – known as OED3 – for the past 21 years.”

It has been a long trajectory toward digitization for the OED. In the 1990s, the OED’s desire to produce a digital version (remember books on CD?) stimulated search engine innovation. To search the OED intelligently, the search engine would have to understand the structure of entries, so that it could distinguish the use of a word as that which is being defined, the use of it within a definition, the use of it within an illustrative quote, etc. SGML was perfect for this type of structure, and the Open Text SGML search engine came out of that research. Tim Bray [twitter:timbray] was one of the architects of that search engine, and went on to become one of the creators of XML. I’m going to assume that some of what Tim learned from the OED project was formative of his later thinking… (Disclosure: I worked at Open Text in the mid-1990s.)

On the other hand, initially, the OED didn’t want to attribute the origins of the word “blog” to Peter Merholz because he coined it in his own blog, and the OED would only accept print attributions. (See here, too.) the OED eventually got over this prejudice for printed sources, however, and gave Peter proper credit.

1 Comment »

July 24, 2009

A twisty path to Chrome in the enterprise

Despite the title of Andrew Conry-Murray’s article in InformationWeek — “Why Business IT Shouldn’t Shrug Off Chrome OS” — it’s on balance quite negative about the prospects for enterprises adopting Google’s upcoming operating system. Andrew argues that enterprises are going to want hybrid systems, Microsoft is already moving into the Cloud, Windows 7 will have been out for a year before Chrome is available, and it’d take a rock larger than the moon to move enterprises off their legacy applications. All good points. (The next article in the issue, by John Foley is more positive about Chrome overall.)

A couple of days I heard a speech by Federal CTO Aneesh Chopra at the Open Government Innovations conference (#ogi to your Twitter buffs). It was fabulous. Aneesh — and he’s an informal enough speaker that I feel ok first-naming him — loves the Net and loves it for the right reasons. (“Right” of course means I agree with him.) The very first item on his list of priorities might be moon-sized when it comes to enterprise IT: Support open standards.

So, suppose the government requires contractors and employees to use applications that save content in open standards. In the document world, that means ODF. Now, ISO also approved a standard favored by (= written by) Microsoft, OOXML, that is far more complex and is highly controversial. There is an open source plug-in for Word that converts Word documents to those formats (apparently Microsoft aided in its development), but that’s not quite native support. So, imagine the following scenario (which I am totally making up): The federal government not only requires that the docs it deals with are in open standard formats, it switches to open source desktop apps in order to save money on license fees. (Vivek Kundra switched tens of thousands of DC employees to open source apps for this reason.) OOXML captures more of the details of a Word document, but ODF is a more workable standard, and it’s the format of the leading open source office apps. If the federal government were to do this, ODF stands a chance of becoming the safe choice for interchanging documents; it’s the one that will always work. And in that case, enterprises might find Word to be over-featured and insufficiently ODF-native.

Now, all of this is pure pretend. And even if ODF were to become the dominant document standard, Microsoft could support it robustly, although that might mean that some of Word’s formatting niceties wouldn’t make the transition. Would business be ok with that? For creators, probably yes; it’d be good to be relieved of the expectation that you will be a document designer. For readers, no. We’ll continue to want highly formatted documents. But, then ODF + formatting specifications can produce quite respectably formatted docs, and that capability will only get better.

So, how likely is my scenario — the feds demand ODF, driving some of the value out of Word, giving enterprises a reason to install free, lower-featured word processors, depriving Windows of one of its main claims on the enterprise’s heart and wallet? Small. But way higher than before we elected President Obama.
TAGS: [Tags: ]

Be the first to comment »

April 27, 2009

Encarta nostalgia: SGML and the Semantic Web

I’m not going to much mourn Encarta’s demise. Wikipedia is too big, too fast, too useful, too much fun. But Encarta was an ambitious project that broke some ground. So, pardon me if I sigh wistfully for a moment, and have a little moment of Encarta appreciation. Ahhhh.

When Encarta began, it was taken as validating this whole crazy CD-ROM approach to knowledge. It was searchable. It had multimedia. It let you do some slicing and dicing. It was breezy, at least compared to its hundred-pound competitors. But for my circle, the big news was below the surface: Encarta used SGML. It was, in fact, one of the first commercial SGML projects delivered into the hands of average customers.

SGML — Standard Generalized Markup Language — was the Semantic Web of its time: roughly the same arguments in its favor, roughly the same approach. This isn’t entirely accidental, for two reasons: 1. HTML is a form of SGML. 2. SGML got a lot of things right.

SGML was a way of specifying the structural elements of a document. In the case of an encyclopedia, elements might include volumes, articles, article titles, subheadings, body text, illustrations, captions, references, and see-also’s. You could also specify the metadata for each element: this illustration is of a dress, its topic is “clothing,” its era is 1920-1930. SGML also let you specify rules about what constitutes a valid instance of a document. For example, the rules might say that a valid encyclopedia article has to have one and only one title, it can any number of illustrations, and every illustration has to have a caption. Once you have created a valid set of documents, you can then use your fancy-dancy computers to assemble views at will: Show me all the illustrations whose “topic” is “clothing” from the era 1920-1930. Etc. Incredibly useful.

You haven’t heard about SGML (at least not much) for a few reasons.

First, industries that wanted to be able to share data wrapped themselves in knots trying to tie down the specific specifications for their documents. Endless and endlessly geeky arguments ensued about how exactly to encode a table of parts.

Second, outside of technical documentation designers, most people don’t think about documents in terms of their structural elements. Rather, they think of documents as a series of formatting decisions. SGML was not designed to capture format. From SGML’s point of view, the title of an article is simply an element called “title” and it’s up to someone else to decide whether titles are bolded, underlined, or printed in red. Now, let me hasten to add that people actually do think of documents in terms of their structure: We decide to make this piece of text bold because it’s the title. But we seem to be reluctant to note those decisions in terms of structure; we’d rather just drag-select the text and hit the bold-it key. That’s why Microsoft Word over the years has made “procedural markup” (drag-select-bold) more prominent in its UI than “declarative markup” (declare this paragraph to be a Title element, and then tell it how to format Titles).

Third, HTML swept the world. HTML is a set of SGML elements and rules specified by a certain Sir Tim. Because HTML is designed not for encyclopedia articles or for shopping lists but for anything that might be put on the Web, it has highly generic elements that do not reflect the content of particular types of pages: It has six levels of headings, two types of lists, one type of image, etc. The SGML folks initially sneered at this. It looked “brain dead” to them. The documents were too generic. There wasn’t enough semantics: That something is a second-level heading expresses its place in the document’s structure, but not the fact that it’s the name of a repair procedure or a list of ingredients. And, HTML seemed too interested in capturing formatting. That’s why newer versions of HTML want you to use <em> (em=emphasis) instead of the original <i>: the old way had you making a formatting decision (“Italicize it”) rather then a structural one (“The role this point plays is that of being emphatic, which the browser should visually express in the way it feels is proper”).

The other side of the coin is that HTML is way way way easier to use than having to design and then follow a set of SGML design rules, with specific elements, for every different sort of document you want to create. Its simplicity meant that people actually succeeded at it. Furthermore, it was in the interest of the browsers to forgive all errors: If browser X rejects a page because it didn’t follow HTML’s rules, you would be driven to see if browser Y could display the page. If Y could, you’d consider X — not the page — to be broken. The browser economics favored sloppiness and forgiveness, neither of which were hallmarks of the SGML’s discipline-based culture.

Now, as the great dialectical pendulum swings, the Semantic Web has arisen to remind us of the value of metadata. If it can avoid the perfectionism and discipline that left SGML as a tool for the few, it will add back in some of the smarts the loose ‘n’ low-hangin’ HTML usefully took out. As the name implies, the Semantic Web is more about expressing the structure of meaning and concepts in a field than about expressing the structure of documents. For an encyclopedia, you wouldn’t want to wait for the Semantic Web to create the entire web of meaning, because that web would have to be as wide as the topical coverage of the encyclopedia itself. You might instead want to come up with a set of standard document elements, perhaps applied somewhat loosely, with the ability to slather on rich layers of metadata, and then watch webs of semantics get spun. Which is pretty much exactly what we’re seeing at Wikipedia.

Meanwhile, Encarta remains an example — along with the Oxford English Dictionary and others — of the value of rigorously structured and metadated documents.

[Tags: ]

1 Comment »


Switch to our mobile site