Joho the Blogstandards Archives - Joho the Blog

June 12, 2016

Beyond bricolage

In 1962, Claude Levi-Strauss brought the concept of bricolage into the anthropological and philosophical lexicons. It has to do with thinking with one’s hands, putting together new things by repurposing old things. It has since been applied to the Internet (including, apparently, by me, thanks to a tip from Rageboy). The term “bricolage” uncovers something important about the Net, but it also covers up something fundamental about the Net that has been growing even more important.

In The Savage Mind (relevant excerpt), CLS argued against the prevailing view that “primitive” peoples were unable to form abstract concepts. After showing that they often in have extensive sets of concepts for flora and fauna, he maintains that these concepts go beyond what they pragmatically need to know:

…animals and plants are not known as a result of their usefulness; they are deemed to be useful or interesting because they are first of all known.

It may be objected that science of this kind can scarcely be of much practical effect. The answer to this is that its main purpose is not a practical one. It meets intellectual requirements rather than or instead of satisfying needs.

It meets, in short, a “demand for order.”

CLS wants us to see the mythopoeic world as being as rich, complex, and detailed as the modern scientific world, while still drawing the relevant distinctions. He uses bricolage as a bridge for our understanding. A bricoleur scavenges the environment for items that can be reused, getting their heft, trying them out, fitting them together and then giving them a twist. The mythopoeic mind engages in this bricolage rather than in the scientific or engineering enterprise of letting a desired project assemble the “raw materials.” A bricoleur has what s/he has and shapes projects around that. And what the bricoleur has generally has been fashioned for some other purpose.

Bricolage is a very useful concept for understanding the Internet’s mashup culture, its culture of re-use. It expresses the way in which one thing inspires another, and the power of re-contextualization. It evokes the sense of invention and play that is dominant on so much of the Net. While the Engineer is King (and, all too rarely, Queen) of this age, the bricoleurs have kept the Net weird, and bless them for it.

But there are at least two ways in which this metaphor is inapt.

First, traditional bricoleurs don’t have search engines that let them in a single glance look across the universe for what they need. Search engines let materials assemble around projects, rather than projects be shaped by the available materials. (Yes, this distinction is too strong. Yes, it’s more complicated than that. Still, there’s some truth to it.)

Second, we have been moving with some consistency toward a Net that at its topmost layers replicates the interoperability of its lower layers. Those low levels specify the rules — protocols — by which networks can join together to move data packets to their destinations. Those packets are designed so they can be correctly interpreted as data by any recipient applications. As you move up the stack, you start to lose this interoperability: Microsoft Word can’t make sense of the data output by Pages, and a graphics program may not be able to make sense of the layer information output by Photoshop.

But, over time, we’re getting better at this:

Applications add import and export services as the market requires. More consequentially, more and richer standards for interoperability continue to emerge, as they have from the very beginning: FTP, HTML, XML, Dublin Core, Schema.org, the many Semantic Web vocabularies, ontologies, and schema, etc.

More important, we are now taking steps to make sure that what we create is available for re-use in ways we have not imagined. We do this by working within standards and protocols. We do it by putting our work into the sphere of reusable items, whether that’s by applying the Creative Commons license, putting our work into a public archive, , or even just paying attention to what will make our work more findable.

This is very different from the bricoleur’s world in which objects are designed for one use, and it takes the ingenuity of the bricoleur to find a new use for it.

This movement continues the initial work of the Internet. From the beginning the Net has been predicated on providing an environment with the fewest possible assumptions about how it will be used. The Net was designed to move anyone’s information no matter what it’s about, what it’s for, where it’s going, or who owns it. The higher levels of the stack are increasingly realizing that vision. The Net is thus more than ever becoming a universe of objects explicitly designed for reuse in unexpected ways. (An important corrective to this sunny point of view: Christian Sandvig’s brilliant description of how the Net has incrementally become designed for delivering video above all else.)

Insofar as we are explicitly creating works designed for unexpected reuse, the bricolage metaphor is flawed, as all metaphors are. It usefully highlights the “found” nature of so much of Internet culture. It puts into the shadows, however, the truly transformative movement we are now living through in which we are explicitly designing objects for uses that we cannot anticipate.

Comments Off on Beyond bricolage

January 26, 2016

Oscars.JSON. Bad, bad JSON

Because I don’t actually enjoy watching football, during the Pats vs. Broncs game on Sunday I transposed the Oscar™ nominations into a JSON file. I did this very badly, but I did it. If you look at it, you’ll see just how badly I misunderstand JSON on some really basic levels.

But I posted it at GitHub™ where you can fix it if you care enough.

Why JSON™? Because it’s an easy format for inputting data into JavaScript™ or many other languages. It’s also human-readable, if you have a good brain for indents. (This is very different from having many indents in your brain, which is one reason I don’t particularly like to watch football™, even with the helmets and rules, etc.)

Anyway, JSON puts data into key:value™ pairs, and lets you nest them into sets. So, you might have a keyword such as “category” that would have values such as “Best Picture™” and “Supporting Actress™.” Within a category you might have a set of keywords such as “film_title” and “person” with the appropriate keywords.

JSON is such a popular way of packaging up data for transport over the Web™ that many (most? all?) major languages have built-in functions for transmuting it into data that the language can easily navigate.

So, why bother putting the Oscar™ nomination info into JSON? In case someone wants to write an app that uses that info. For example, if you wanted to create your own Oscar™ score sheet or, to be honest, entry tickets for your office pool, you could write a little script and output it exactly as you’d like. (Or you could just google™ for someone else’s Oscar™ pool sheet.) (I also posted a terrible little PHP script™ that does just that.)

So, a pointless™ exercise™ in truly terrible JSON design™. You’re welcome™!

Comments Off on Oscars.JSON. Bad, bad JSON

June 22, 2013

What I learned at LODLAM

On Wednesday and Thursday I went to the second LODLAM (linked open data for libraries, archives, and museums) unconference, in Montreal. I’d attended the first one in San Francisco two years ago, and this one was almost as exciting — “almost” because the first one had more of a new car smell to it. This is a sign of progress and by no means is a complaint. It’s a great conference.

But, because it was an unconference with up to eight simultaneous sessions, there was no possibility of any single human being getting a full overview. Instead, here are some overall impressions based upon my particular path through the event.

  • Serious progress is being made. E.g., Cornell announced it will be switching to a full LOD library implementation in the Fall. There are lots of great projects and initiatives already underway.

  • Some very competent tools have been developed for converting to LOD and for managing LOD implementations. The development of tools is obviously crucial.

  • There isn’t obvious agreement about the standard ways of doing most things. There’s innovation, re-invention, and lots of lively discussion.

  • Some of the most interesting and controversial discussions were about whether libraries are being too library-centric and not web-centric enough. I find this hugely complex and don’t pretend to understand all the issues. (Also, I find myself — perhaps unreasonably — flashing back to the Standards Wars in the late 1980s.) Anyway, the argument crystallized to some degree around BIBFRAME, the Library of Congress’ initiative to replace and surpass MARC. The criticism raised in a couple of sessions was that Bibframe (I find the all caps to be too shouty) represents how libraries think about data, and not how the Web thinks, so that if Bibframe gets the bib data right for libraries, Web apps may have trouble making sense of it. For example, Bibframe is creating its own vocabulary for talking about properties that other Web standards already have names for. The argument is that if you want Bibframe to make bib data widely available, it should use those other vocabularies (or, more precisely, namespaces). Kevin Ford, who leads the Bibframe initiative, responds that you can always map other vocabs onto Bibframe’s, and while Richard Wallis of OCLC is enthusiastic about the very webby Schema.org vocabulary for bib data, he believes that Bibframe definitely has a place in the ecosystem. Corey Harper and Debra Riley-Huff, on the other hand, gave strong voice to the cultural differences. (If you want to delve into the mapping question, explore the argument about whether Bibframe’s annotation framework maps to Open Annotation.)

  • I should add that although there were some strong disagreements about this at LODLAM, the participants seem to be genuinely respectful.

  • LOD remains really really hard. It is not a natural way of thinking about things. Of course, neither are old-fashioned database schemas, but schemas map better to a familiar forms-based view of the world: you fill in a form and you get a record. Linked data doesn’t even think in terms of records. Even with the new generation of tools, linked data is hard.

  • LOD is the future for library, archive, and museum data.


Here’s a list of brief video interviews I did at LODLAM:

Comments Off on What I learned at LODLAM

March 28, 2013

[annotations][2b2k] Rob Sanderson on annotating digitized medieval manuscripts

Rob Sanderson [twitter:@azaroth42] of Los Alamos is talking about annotating Medieval manuscripts.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

He says many Medieval manuscripts are being digitized. The Mellon Foundation is funding many such projects. But these have tended to reinvent the same tech, and have not been designed for interoperability with other projects. So the Digital Medieval Initiative was founded, with a long list of prestigious partners. They thought about what they’d like: distributed, linked data, interoperable, etc. For this they need a shared description format.

The traditional approach is annotate an image of a page. But it can be very difficult to know which images to annotate; he gives as an example a page that has fold-outs. “The naive assuption is that an image equals a page.” But there may be fragments, or only portions of the page have been digitized (e.g., the illuminations), etc. There may be multiple images on a page, revealed by multi-spectral imaging. There may be multiple orientations of the page, etc.

The solution? The canvas paradigm. A canvas is an empty space corresponding to the rectangle (or whatever) of the page. You allow rich resources to be associated with it, and allow users to comment. For this, they use Open Annotation. You can specify a choice of images. You can associate text with an area of the canvas. There are lots of different ways to visualize those comments: overlays, side-by-side, etc.

You can build hybrid pages. For example, and old scan might have a new color scan of its illustrations pointing at it. Or you could have a recorded performance of a piece of music pointing at the musical notation.

In summary, the SharedCanvas model uses open standards (HTML 5, Open Annotation, TEI, etc.) and can be implement distributed across reporsitories, encouraging engagement by domain experts.

Comments Off on [annotations][2b2k] Rob Sanderson on annotating digitized medieval manuscripts

May 30, 2012

Interop: The Book

John Palfrey and Urs Gasser are giving a book talk at Harvard about their new book, Interop. (It’s really good. Broad, thoughtful, engaging. Not at all focused on geeky tech issues.) NOTE: Posted without re-reading

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

[The next day: Nathan Matias has posted a far better live-blog post of this event.)

JP says the topic of interop seems on the face of it like it should be “very geeky and very dull.” He says the book started out fairly confined, about the effect of interop on innovation. But as they worked on it, it got broader. E.g., the Facebook IPO has been spun about the stock price’s ups and downs. But from an interop perspective, the story is about why FB was worth $100B or more when its revenues don’t indicate any such thing. It’s because FB’s interop in our lives make it hard to extract. But from this also come problems, which is why the subtitle of the book Interop talks about its peril.

Likewise, the Flame virus shows that viral outbreaks cannot be isolated easily. We need fire breaks to prevent malware from spreading.

In the book, JP and Urs look at how railroad systems became interoperable. Currency is like that, too: currencies vary but we are able to trade across borders. This has been great for the global economy, but it make problems. E.g., the Greek economic meltdown shows the interdependencies of economies.

The book gives a concise def of interop: “The ability to transfer and render userul data and other information across systems (including organizations), applications or components.” But that is insufficient. The book sees interop more broadly as “The art and science of working together.” The book talks about interop in terms of four levels: data, tech, humans, and institutions.

They view the book as an inquiry, some of which is expressed in a series of case studies and papers.

Urs takes the floor. He’s going to talk about a few case studies.

First, how can we make our cities smarter using tech? (Urs shows an IBM video that illustrates how dependent we are on sharing information.) He draws some observations:

  • Solutions to big societal problems increasingly depend on interoperability — from health care to climate change.

  • Interop is not black or white. Many degrees. E.g., power plugs are not interoperable around the world, but there are converters. Or, international air travel requires a lot of interop among the airlines.

  • Interop is a design challenge. In fact, once you’ve messed up with interop, it’s hard to make it right. E.g., it took a long time to fix air traffic control systems because there was a strongly embedded legacy system.

  • There are important benefits, including systems efficiency, user choice, and economic growth.

Urs points to their four-layer model. To make a smart city, the tech the firefighters and police use need to interop, as do their data. But at the human layer, the language used to vary among branches; e.g., “333” might code one thing for EMTs and another for the police. At the institutional layer, the laws for privacy might not be interoperable, making it hard for businesses to work globally.

Second example: When Facebook opened its APIs so that other apps could communicate with FB, there was a spike in innovation; 4k apps were made by non-FB devs that plug into FB. FB’s decision to become more interoperable led to innovation. Likewise for Twitter. “Much of the story behind Twitter is an interop question.”

Likewise for Ushahidi; after the Haitian earthquake, it made a powerful platform that enabled people to share and accumulate info, mapping it, across apps and devices. This involved all layers of the interop stack, from data to institutions such as the UN pitching in. (Urs also points to safe2pee.org :)

Observations:

  • There’s a cycle of interop, competition, and innovation.

  • There are theories of innovation, including generativity (Zittrain), user-driven innovation (Von Hippel) and small-step innocations (Christensen).

  • Caveat: More interop isn’t always good. A highly interop business can take over the market, creating a de facto monopoly, and suppressing innovation.

  • Interop also can help diffuse adoption. E.g., the transition to high def tv: it only took off once the tvs were were able to interoperate between analog and digital signals.

Example 3: Credit cards are highly interoperable: whatever your buying opportunity is, you can use a selection of cards that work with just about any bank. Very convenient.

Observations:

  • this level of interop comes with costs and risks: identity thefts, security problems, etc.

  • The benefits outweigh the risks

  • This is a design problem

  • More interop creates more problems because it means there are more connection points.

Example 4: Cell phone chargers. Traditionally phones had their own chargers. Why? Europe addressed this by the “Sword of Damocles” approach that said that if the phone makers didn’t get their act together, the EC would regulate them into it. The micro-USB charger is now standard in Europe.

Observations:

  • It can take a long time, because of the many actors, legacy problems, and complexity.

  • It’s useful to think about these issues in terms of a 2×2 of regulation/non-regulation, and collaborative-unilateral.

JP back up. He is going to talk about libraries and the preservation of knowledge as interop problems. Think about this as an issue of maintaining interop over time. E.g., try loading up one of your floppy disks. The printed version is much more useful over the long term. Libraries find themselves in a perverse situation: If you provide digital copies of books, you can provide much less than physical books. Five of the 6 major publishers won’t let libraries lend e versions. It’d make sense to have new books provided on an upon standard format. So, even if libraries could lend the books, people might not have the interoperable tech required to play it. Yet libraries are spending more on e-books, and less on physical. If libraries have digital copies and not physical copies, they are are vulnerable to tech changes. How do we insure that we can continuously update? The book makes a fairly detailed suggestion. But as it stands, as we switch from one format to another over time, we’re in worse shape than if we had physical books. We need to address this. “When it comes to climate change, or electronic health records, or preservation of knowledge, interop matters, both as a theory and as a practice.” We need to do this by design up front, deciding what the optimal interop is in each case.

Q&A

Q: [doc searls] Are there any places where you think we should just give up?

A: [jp] I’m a cockeyed optimist. We thought that electronic health records in the US is the hardest case we came across.

Q: How does the govt conduct consultations with experts from across the US. What would it take to create a network of experts?

A: [urs] Lots of expert networks that have emerged, enabled by tech that fosters from the bottom up human interoperability.
A: [jp] It’s not clear to me that we want that level of consultation. I don’t know that we could manage direct democracy enabled in that way.

Q: What are the limits you’d like to see emerge on interop. I.e., I’m thinking of problems of hyper-coherence in bio: a single species of rice or corn that may be more efficient can turn out to be with one blight to have been a big mistake. How do you build in systems of self-limit?

[urs] We try to address this somewhat in a chapter on diversity, which begins with biodiversity. When we talk about interop, we do not suggest merging or unifying systems. To the contrary, interop is a way to preserve diversity, and prevent fragmentation within diversity. It’s extremely difficult to find the optimums, which varies from case to case, and to decide on which speed bumps to put in place.
[jp] You’ve gone to the core of what we’re thinking about.

Q: Human autonomy, efficiency, and economic growth are three of the benefits you mention, but they can be in conflict with one another. How important are decentralized systems?

[urs] We’re not arguing in favor of a single system, e.g., that we have only one type of cell phone. That’s exactly not what we’re arguing for. You want to work toward the sweet spot of interop.
[jp] They are in tension, but there are some highly complex systems where they coexist. E.g., the Web.

Q: Yes, having a single cell phone charger is convenient. But there may be a performance tradeoff, where you can’t choose the optimum voltage if you standard on 5V. And an innovation deficit: you won’t get magnetic plugs, etc.

[urs] Yes. This is one of the potential downsides of interop. It may lock you in. When you get interop by choosing a standard, you freeze the standard for the future. So one of the additional challenge is: how can we incorporate mechanisms of learning into standards-setting?
Cell phone chargers don’t have a lot of layers on top of them, so the standardization doesn’t have quite the ripples through generativity. And that’s what the discussion should be about.

Comments Off on Interop: The Book

March 4, 2011

[2b2k] Tagging big data

According to an article in Science Insider by Dennis Normile, a group formed at a symposium sponsored by the Board on Global Science and Technology, of the National Research Council, an arm of the U.S. National Academies [that’s all they’ve got??] is proposing making it easier to find big scientific data sets by using a standard tag, along with a standard way of conveying the basic info about the nature of the set, and its terms of use. “The group hopes to come up with a protocol within a year that researchers creating large data sets will voluntarily adopt. The group may also seek the endorsement of the Internet Engineering Task Force…”

2 Comments »

February 20, 2011

[2b2k] Public data and metadata, Google style

I’m finding Google Lab’s Dataset Publishing Language (DSPL) pretty fascinating.

Upload a set of data, and it will do some semi-spiffy visualizations of it. (As Apryl DeLancey points out, Martin Wattenberg and Fernanda Viegas now work for Google, so if they’re working on this project, the visualizations are going to get much better.) More important, the data you upload is now publicly available. And, more important than that, the site wants you to upload your data in Google’s DSPL format. DSPL aims at getting more metadata into datasets, making them more understandable, integrate-able, and re-usable.

So, let’s say you have spreadsheets of “statistical time series for unemployment and population by country, and population by gender for US states.” (This is Google’s example in its helpful tutorial.)

  • You would supply a set of concepts (“population”), each with a unique ID (“pop”), a data type (“integer”), and explanatory information (“name=population”, “definition=the number of human beings in a geographic area”). Other concepts in this example include country, gender, unemployment rate, etc. [Note that I’m not using the DSPL syntax in these examples, for purposes of readability.]

  • For concepts that have some known set of members (e.g., countries, but not unemployment rates), you would create a table — a spreadsheet in CSV format — of entries associated with that concept.

  • If your dataset uses one of the familiar types of data, such as a year, geographical position, etc., you would reference the “canonical concepts” defined by Google.

  • You create a “slice” or two, that is, “a combination of concepts for which data exists.” A slice references a table that consists of concepts you’ve already defined and the pertinent values (“dimensions” and “metrics” in Google’s lingo). For example, you might define a “countries slice” table that on each row lists a country, a year, and the country’s population in that year. This table uses the unique IDs specified in your concepts definitions.

  • Finally, you can create a dataset that defines topics hierarchically so that users can more easily navigate the data. For example, you might want to indicate that “population” is just one of several characteristics of “country.” Your topic dataset would define those relations. You’d indicate that your “population” concept is defined in the topic dataset by including the “population topic” ID (from the topic dataset) in the “population” concept definition.

When you’re done, you have a data set you can submit to Google Public Data Explorer, where the public can explore your data. But, more important, you’ve created a dataset in an XML format that is designed to be rich in explanatory metadata, is portable, and is able to be integrated into other datasets.

Overall, I think this is a good thing. But:

  • While Google is making its formats public, and even its canonical definitions are downloadable, DSPL is “fully open” for use, but fully Google’s to define. Having the 800-lbs gorilla defining the standard is efficient and provides the public platform that will encourage acceptance. And because the datasets are in XML, Google Public Data Explorer is not a roach motel for data. Still, it’d be nice if we could influence the standard more directly than via an email-the-developers text box.

  • Defining topics hierarchically is a familiar and useful model. I’m curious about the discussions behind the scenes about whether to adopt or at least enable ontologies as well as taxonomies.

  • Also, I’m surprised that Google has not built into this standard any expectation that data will be sourced. Suppose the source of your US population data is different from the source of your European unemployment statistics? Of course you could add links into your XML definitions of concepts and slices. But why isn’t that a standard optional element?

  • Further (and more science fictional), it’s becoming increasingly important to be able to get quite precise about the sources of data. For example, in the library world, the bibliographic data in MARC records often comes from multiple sources (local cataloguers, OCLC, etc.) and it is turning out to be a tremendous problem that no one kept track of who put which datum where. I don’t know how or if DSPL addresses the sourcing issue at the datum level. I’m probably asking too much. (At least Google didn’t include a copyright field as standard for every datum.)

Overall, I think it’s a good step forward.

1 Comment »

June 5, 2010

Book formats and formatting books

AKMA points to an excellent post by Jacqui Cheng at Ars Technica about the fragmentation of ebook standards. AKMA would love to see a publisher offer an easy way of “pouring” text into an open format that creates a useful, beautiful digital book. Jacqui points to the major hurdle: Ebook makers like owning their own format so they can “vertically integrate,” i.e., lock users into their own bookstores.

Even if there werent that major economic barrier, itd be hard to do what AKMA and we all want because books are incredibly complex objects. You can always pour text into a new container, but its much harder to capture the structure of the book this is a title, that is body text, this is a caption. The structure is then reflected in the format of the book titles are bolded, etc. and in the electronic functionality of the book tables of contents are linked to the contents, etc.. We are so well-trained in reading books that even small errors interrupt our reading: My Kindle doesnt count hyphens as places where words can be broken across lines, resulting in some butt-ugly layouts. A bungled drop cap can mystify you for several seconds. White-space breaks between sections that are not preserved when they occur at the end of page can ruin a good mid-chapter conclusion. Its not impossible to get all this right, but its hard.

And getting a standard format that captures the right degrees of structure and of format, and that is sufficiently forgiving so just about anything can be poured to it is really difficult because there are no such right degrees. E.g., epub is not great at layout info at least according to Wikipedia.

All Im saying is: Its really really hard.

14 Comments »

September 18, 2009

[berkman] Transforming Scholarly Communication

Lee Dirks [site] Director of Education and Scholarly Communication at Microsoft External Research is giving a Berkman-sponsored talk on “Transforming Scholarly Communications.” His group works with various research groups “to develop functionality that we think would benefit the community overall,” with Microsoft possibly as a facilitator. (Alex Wade from his group is also here.)

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

He begins by noting the “data deluge.” But, compuing is stepping up to the problem: Massive data sets, evolution of multicore, and the power of the cloud. We’ll need all that (Lee says) because the workflow for processing all the new info we’re gathering hasn’t kept up with the amount we’re taking in via sensor networks, global databases, laboratory instruments, desktops, etc. He points to the Life Under Your Feet project at Johns Hopkins as an example. They have 200 wireless computers, each with 10 sensors, monitoring air and soil temperature and moisture, and much more. (Microsoft funds it.) Lee recommends Joe Hellerstein’s blog if you’re interested in “the commoditization of massive data analysis.” We’re at the very early stages of this, Lee says. For e-scientists and e-researchers, there’s just too much: too much data, too much workflow, too much “opportunity.”


We need to move upstream in the research lifecycle: 1. collect data and do research, 2. author it, 3. publish, and then 4. store and archive it. That store then feeds future research and analysis. Lee says this four-step lifecycle needs collaboration and discovery. Libraries and archives spend most of their time in stage 4, but they ought to address the problems much early on. The most advanced thinkers are working on these earlier stages.


“The trick there is integration.” Some domains are quite proprietary about their data, which makes it problematic to get data and curation standards so that the data can move from system to system. From Microsoft’s perspective, the question is how can they move from static summaries to much richer information vehicles. Why can’t a research reports be containers that facilitate reproducible science? It should help you use your methodology against its data set. Alter data and see the results, and then share it. Collaborate real time with other researchers. Capture reputation and influence. Dynamic documents. [cf. Interleaf Active Documents, circa 1990. The dream still lives!]


On the commercial side, Elsevier has been running an “Article of the Future Competition.” Other examples: PLoS Currents: Influenza. Nature Preceedings. Google Wave. Mendeley (“iTunes for academic papers”). These are “chinks in the armor of the peer review system.”


Big changes, Lee says. We’ll see more open access and new economic models, particularly adding services on top of content. We’ll see a world in which data is increasingly easily sharable. E.g., the Sloan Digital Sky Survey ios a prototyupe in data publishing: 350M web hits in 6yrs, 930k distinct users, 10k astronmers, delivered 100B rows of data. Likewise, GalaxyZoo.org at which the public can classify galaxies and occasionally discover a new object or two.


Lee points to challenges with data sharing: integrating it, annotating, maintaining provenance and quality, exporting in agreed formats, security. These issues have stopped some from sharing data, and have forced some communities to remain proprietary. “The people who can address these problems in creative ways” will be market leaders moving forward.


Lee points to some existing sharing and analysis services. Swivel, IBM’s Many Eyes, Google’s Gapminder, Freebase, CSA’s Illustra…


The business models are shifting. Publishers are now thinking about data sharing services. IBM and RedHat provides an interesting model: Giving the code away but selling services. Repositories will contain not only the full text versions of reserach papers, but also “gray” literature “such as technical reports and theses,” and real-time streaming data, images and software. We need enhanced interoperability protocols.


E.g., Data.gov provides a searchable data catalog that provides access through the raw data and using various tools. Lee also likes WorldWideScience.org, “a global science gateway” to international scientific databases. Sxty-sevenety countries are pooling their scientific data and providing federated search.


Lee believes that semantic computing will provide fantastic results, although it may take a while. He points to Cameron Neylon’s discussion of the need to generate lab report feeds. (Lee says the Semantic Web is just one of the tools that cojuld be used for semantics-based computing,.) So, how do we take advantage of this? Recommender systems, as at Last.fm and Amazon. Connotea and BioMedCentral’s Faculty of 1000 are early examples of this [LATER: Steve Pog’s comment below says Faculty of 1000 is not owned by BioMedCentral] . Lee looks forward to the automatic correlation of scientific data and the “smart composition of services and functionality,” in which the computers do the connecting. And we’re going to need the cloud to do this sort of thing, both for the computing power and for the range of services that can be brought to bear on the distributed collection of data.


Lee spends some time talkingabout the cloud. Among other points, he points to SciVee and Viddler as interesting examples. Also, SmugMug as a photo aggregator that owns none of its own infrastructure. Also Slideshare and Google Docs. But these aren’t quite what researchers need, which is an opportunity. Also interesting: NSF DataNet grants.


When talking about preservation and provenance, Lee cites DuraSpace and its project, DuraCloud. It’s a cross-repository space with services added. Institutions pay for the service.


Lee ends by pointing to John Wilbanks‘ concern about the need for a legal and policy infrastructure that enables and encourages sharing. Lee says that at the end of the day, it’s not software, but providing incentives and rewards to get people to participate.


Q: How soon will this happen?
A: We can’t predict which domains will arise and which ones people will take to.


Q: What might bubble up from the consumer sector?
A: It’s an amazing space to watch. There are lots of good examples already?


Q: [me] This is great to have you proselytizing outside. But as an internal advocate inside Microsoft, what does Msft still have to do, and what’s the push back?
A: We’ve built 6-8 add-ins for Word for semantic markup, scholarly writing, consumption of ontologies. A repository platform. An open source foundation separate from Micrsooft, contributing to Linux kernel, etc.

Q: You’d be interested in Dataverse.org.
A: Yes, it sounds like it.


Q: Data is agnostic, but how articles aren’t…
A: We’re trying to figure out how to embed and link. But we’re also thinking about how you do it without the old containers, on the Web, in Google Wave, etc.
Q: Are you providing a way to ID relationships?
A: In part. For people using their ordinary tools (e.g., Word), we’re providing ways to import ontologies, share them with the repository or publisher, etc.


Q: How’s auto-tagging coming? The automatic creation of semantically correct output?
A: We’re working on this. A group at Oxford doing cancer research allows researchers to semantically annotate within Excel, so that the spreadsheet points to an ontology that specifies the units, etc. Fluxnet.org is an example of collaborative curation within a single framework.


Q: Things are blurring. Traditionally libraries collect, select and preserve schoilarly info. What do you think the role of the library will be?
A: I was an academic librarian. In my opinion, the safe world of collecting library journals has been done. We know how to do it. The problem these days is data curation, providing services, working with publishers.
Q: It still takes a lot of money…
A: Definitely. But the improvements are incremental. The bigger advances come further up the stream.

Q: Some cultures will resist sharing…
A: Yes. It’ll vary from domain to domain, and within domains. In some cases we’ll have to wait a generation.


Q: What skills would you give a young librarian?
A: I don’t have a pat answer for you. But, a service orientation would help, building services on top of the data, for example. Multi-disciplinary partnerships.


Q: You’re putting more info online. Are you seeing the benefit of that?
A: Most researchers already have Microsoft software, so we’re not putting the info up in order to sell more. We’re trying to make sure researchers know what’s there for them.

8 Comments »

July 24, 2009

A twisty path to Chrome in the enterprise

Despite the title of Andrew Conry-Murray’s article in InformationWeek — “Why Business IT Shouldn’t Shrug Off Chrome OS” — it’s on balance quite negative about the prospects for enterprises adopting Google’s upcoming operating system. Andrew argues that enterprises are going to want hybrid systems, Microsoft is already moving into the Cloud, Windows 7 will have been out for a year before Chrome is available, and it’d take a rock larger than the moon to move enterprises off their legacy applications. All good points. (The next article in the issue, by John Foley is more positive about Chrome overall.)

A couple of days I heard a speech by Federal CTO Aneesh Chopra at the Open Government Innovations conference (#ogi to your Twitter buffs). It was fabulous. Aneesh — and he’s an informal enough speaker that I feel ok first-naming him — loves the Net and loves it for the right reasons. (“Right” of course means I agree with him.) The very first item on his list of priorities might be moon-sized when it comes to enterprise IT: Support open standards.

So, suppose the government requires contractors and employees to use applications that save content in open standards. In the document world, that means ODF. Now, ISO also approved a standard favored by (= written by) Microsoft, OOXML, that is far more complex and is highly controversial. There is an open source plug-in for Word that converts Word documents to those formats (apparently Microsoft aided in its development), but that’s not quite native support. So, imagine the following scenario (which I am totally making up): The federal government not only requires that the docs it deals with are in open standard formats, it switches to open source desktop apps in order to save money on license fees. (Vivek Kundra switched tens of thousands of DC employees to open source apps for this reason.) OOXML captures more of the details of a Word document, but ODF is a more workable standard, and it’s the format of the leading open source office apps. If the federal government were to do this, ODF stands a chance of becoming the safe choice for interchanging documents; it’s the one that will always work. And in that case, enterprises might find Word to be over-featured and insufficiently ODF-native.

Now, all of this is pure pretend. And even if ODF were to become the dominant document standard, Microsoft could support it robustly, although that might mean that some of Word’s formatting niceties wouldn’t make the transition. Would business be ok with that? For creators, probably yes; it’d be good to be relieved of the expectation that you will be a document designer. For readers, no. We’ll continue to want highly formatted documents. But, then ODF + formatting specifications can produce quite respectably formatted docs, and that capability will only get better.

So, how likely is my scenario — the feds demand ODF, driving some of the value out of Word, giving enterprises a reason to install free, lower-featured word processors, depriving Windows of one of its main claims on the enterprise’s heart and wallet? Small. But way higher than before we elected President Obama.
TAGS: [Tags: ]

Comments Off on A twisty path to Chrome in the enterprise

Next Page »