Lee Dirks [site] Director of Education and Scholarly Communication at Microsoft External Research is giving a Berkman-sponsored talk on “Transforming Scholarly Communications.” His group works with various research groups “to develop functionality that we think would benefit the community overall,” with Microsoft possibly as a facilitator. (Alex Wade from his group is also here.)
NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.
He begins by noting the “data deluge.” But, compuing is stepping up to the problem: Massive data sets, evolution of multicore, and the power of the cloud. We’ll need all that (Lee says) because the workflow for processing all the new info we’re gathering hasn’t kept up with the amount we’re taking in via sensor networks, global databases, laboratory instruments, desktops, etc. He points to the Life Under Your Feet project at Johns Hopkins as an example. They have 200 wireless computers, each with 10 sensors, monitoring air and soil temperature and moisture, and much more. (Microsoft funds it.) Lee recommends Joe Hellerstein’s blog if you’re interested in “the commoditization of massive data analysis.” We’re at the very early stages of this, Lee says. For e-scientists and e-researchers, there’s just too much: too much data, too much workflow, too much “opportunity.”
We need to move upstream in the research lifecycle: 1. collect data and do research, 2. author it, 3. publish, and then 4. store and archive it. That store then feeds future research and analysis. Lee says this four-step lifecycle needs collaboration and discovery. Libraries and archives spend most of their time in stage 4, but they ought to address the problems much early on. The most advanced thinkers are working on these earlier stages.
“The trick there is integration.” Some domains are quite proprietary about their data, which makes it problematic to get data and curation standards so that the data can move from system to system. From Microsoft’s perspective, the question is how can they move from static summaries to much richer information vehicles. Why can’t a research reports be containers that facilitate reproducible science? It should help you use your methodology against its data set. Alter data and see the results, and then share it. Collaborate real time with other researchers. Capture reputation and influence. Dynamic documents. [cf. Interleaf Active Documents, circa 1990. The dream still lives!]
On the commercial side, Elsevier has been running an “Article of the Future Competition.” Other examples: PLoS Currents: Influenza. Nature Preceedings. Google Wave. Mendeley (“iTunes for academic papers”). These are “chinks in the armor of the peer review system.”
Big changes, Lee says. We’ll see more open access and new economic models, particularly adding services on top of content. We’ll see a world in which data is increasingly easily sharable. E.g., the Sloan Digital Sky Survey ios a prototyupe in data publishing: 350M web hits in 6yrs, 930k distinct users, 10k astronmers, delivered 100B rows of data. Likewise, GalaxyZoo.org at which the public can classify galaxies and occasionally discover a new object or two.
Lee points to challenges with data sharing: integrating it, annotating, maintaining provenance and quality, exporting in agreed formats, security. These issues have stopped some from sharing data, and have forced some communities to remain proprietary. “The people who can address these problems in creative ways” will be market leaders moving forward.
The business models are shifting. Publishers are now thinking about data sharing services. IBM and RedHat provides an interesting model: Giving the code away but selling services. Repositories will contain not only the full text versions of reserach papers, but also “gray” literature “such as technical reports and theses,” and real-time streaming data, images and software. We need enhanced interoperability protocols.
E.g., Data.gov provides a searchable data catalog that provides access through the raw data and using various tools. Lee also likes WorldWideScience.org, “a global science gateway” to international scientific databases. Sxty-sevenety countries are pooling their scientific data and providing federated search.
Lee believes that semantic computing will provide fantastic results, although it may take a while. He points to Cameron Neylon’s discussion of the need to generate lab report feeds. (Lee says the Semantic Web is just one of the tools that cojuld be used for semantics-based computing,.) So, how do we take advantage of this? Recommender systems, as at Last.fm and Amazon. Connotea and BioMedCentral’s Faculty of 1000 are early examples of this [LATER: Steve Pog’s comment below says Faculty of 1000 is not owned by BioMedCentral] . Lee looks forward to the automatic correlation of scientific data and the “smart composition of services and functionality,” in which the computers do the connecting. And we’re going to need the cloud to do this sort of thing, both for the computing power and for the range of services that can be brought to bear on the distributed collection of data.
Lee spends some time talkingabout the cloud. Among other points, he points to SciVee and Viddler as interesting examples. Also, SmugMug as a photo aggregator that owns none of its own infrastructure. Also Slideshare and Google Docs. But these aren’t quite what researchers need, which is an opportunity. Also interesting: NSF DataNet grants.
When talking about preservation and provenance, Lee cites DuraSpace and its project, DuraCloud. It’s a cross-repository space with services added. Institutions pay for the service.
Lee ends by pointing to John Wilbanks‘ concern about the need for a legal and policy infrastructure that enables and encourages sharing. Lee says that at the end of the day, it’s not software, but providing incentives and rewards to get people to participate.
Q: How soon will this happen?
A: We can’t predict which domains will arise and which ones people will take to.
Q: What might bubble up from the consumer sector?
A: It’s an amazing space to watch. There are lots of good examples already?
Q: [me] This is great to have you proselytizing outside. But as an internal advocate inside Microsoft, what does Msft still have to do, and what’s the push back?
A: We’ve built 6-8 add-ins for Word for semantic markup, scholarly writing, consumption of ontologies. A repository platform. An open source foundation separate from Micrsooft, contributing to Linux kernel, etc.
Q: You’d be interested in Dataverse.org.
A: Yes, it sounds like it.
Q: Data is agnostic, but how articles aren’t…
A: We’re trying to figure out how to embed and link. But we’re also thinking about how you do it without the old containers, on the Web, in Google Wave, etc.
Q: Are you providing a way to ID relationships?
A: In part. For people using their ordinary tools (e.g., Word), we’re providing ways to import ontologies, share them with the repository or publisher, etc.
Q: How’s auto-tagging coming? The automatic creation of semantically correct output?
A: We’re working on this. A group at Oxford doing cancer research allows researchers to semantically annotate within Excel, so that the spreadsheet points to an ontology that specifies the units, etc. Fluxnet.org is an example of collaborative curation within a single framework.
Q: Things are blurring. Traditionally libraries collect, select and preserve schoilarly info. What do you think the role of the library will be?
A: I was an academic librarian. In my opinion, the safe world of collecting library journals has been done. We know how to do it. The problem these days is data curation, providing services, working with publishers.
Q: It still takes a lot of money…
A: Definitely. But the improvements are incremental. The bigger advances come further up the stream.
Q: Some cultures will resist sharing…
A: Yes. It’ll vary from domain to domain, and within domains. In some cases we’ll have to wait a generation.
Q: What skills would you give a young librarian?
A: I don’t have a pat answer for you. But, a service orientation would help, building services on top of the data, for example. Multi-disciplinary partnerships.
Q: You’re putting more info online. Are you seeing the benefit of that?
A: Most researchers already have Microsoft software, so we’re not putting the info up in order to sell more. We’re trying to make sure researchers know what’s there for them.