Joho the Blog » science

August 13, 2012

Hummingbirds live in The Shire

I’ve been watching hummingbirds at our feeder, and took a moment to read up on them a bit more. Hummingbirds.net has a lot of interesting information, including about their impossible migrations. (These migrations are proved by the Internet and reported by people like you and me.)

But what really amused me was this straightforward and presumably accurate description of their nests:

The walnut-sized nest, built by the female, is constructed on a foundation of bud scales attached to a tree limb with spider silk; lichens camouflage the outside, and the inside is lined with dandelion, cattail, or thistle down.

Undoubtedly tended by singing dragonflies that feed on unicorn tears.

3 Comments »

July 19, 2012

[2b2k][eim]Digital curation

I’m at the “Symposium on Digital Curation in the Era of Big Data” held by the Board on Research Data and Information of the National Research Council. These liveblog notes cover (in some sense — I missed some folks, and have done my usual spotty job on the rest) the morning session. (I’m keynoting in the middle of it.)

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.


Alan Blatecky [pdf] from the National Science Foundation says science is being transformed by Big Data. [I can't see his slides from the panel at front.] He points to the increase in the volume of data, but we haven’t paid enough attention to the longevity of the data. And, he says, some data is centralized (LHC) and some is distributed (genomics). And, our networks are unable to transport large amounts of data [see my post], making where the data is located quite significant. NSF is looking at creating data infrastructures. “Not one big cloud in the sky,” he says. Access, storage, services — how do we make that happen and keep it leading edge? We also need a “suite of policies” suitable for this new environment.


He closes by talking about the Data Web Forum, a new initiative to look at a “top-down governance approach.” He points positively to the IETF’s “rough consensus and running code.” “How do we start doing that in the data world?” How do we get a balanced representation of the community? This is not a regulatory group; everything will be open source, and progress will be through rough consensus. They’ve got some funding from gov’t groups around the world. (Check CNI.org for more info.)


Now Josh Greenberg from the Sloan Foundation. He points to the opportunities presented by aggregated Big Data: the effects on social science, on libraries, etc. But the tools aren’t keeping up with the computational power, so researchers are spending too much time mastering tools, plus it can make reproducibility and provenance trails difficult. Sloan is funding some technical approaches to increasing the trustworthiness of data, including in publishing. But Sloan knows that this is not purely a technical problem. Everyone is talking about data science. Data scientist defined: Someone who knows more about stats than most computer scientists, and can write better code than typical statisticians :) But data science needs to better understand stewardship and curation. What should the workforce look like so that the data-based research holds up over time? The same concerns apply to business decisions based on data analytics. The norms that have served librarians and archivists of physical collections now apply to the world of data. We should be looking at these issues across the boundaries of academics, science, and business. E.g., economics works now rests on data from Web businesses, US Census, etc.

[I couldn't liveblog the next two — Michael and Myron — because I had to leave my computer on the podium. The following are poor summaries.]

Michael Stebbins, Assistant Director for Biotechnology in the Office of Science and Technology Policy in the White House, talked about the Administration’s enthusiasm for Big Data and open access. It’s great to see this degree of enthusiasm coming directly from the White House, especially since Michael is a scientist and has worked for mainstream science publishers.


Myron Gutmann, Ass’t Dir of of the National Science Foundation likewise expressed commitment to open access, and said that there would be an announcement in Spring 2013 that in some ways will respond to the recent UK and EC policies requiring the open publishing of publicly funded research.


After the break, there’s a panel.


Anne Kenney, Dir. of Cornell U. Library, talks about the new emphasis on digital curation and preservation. She traces this back at Cornell to 2006 when an E-Science task force was established. She thinks we now need to focus on e-research, not just e-science. She points to Walters and Skinners “New Roles for New Times: Digital Curation for Preservation.” When it comes to e-research, Anne points to the need for metadata stabilization, harmonizing applications, and collaboration in virtual communities. Within the humanities, she sees more focus on curation, the effect of the teaching environment, and more of a focus on scholarly products (as opposed to the focus on scholarly process, as in the scientific environment).


She points to Youngseek Kim et al. “Education for eScience Professionals“: digital curators need not just subject domain expertise but also project management and data expertise. [There's lots of info on her slides, which I cannot begin to capture.] The report suggests an increasing focus on people-focused skills: project management, bringing communities together.


She very briefly talks about Mary Auckland’s “Re-Skilling for Research” and Williford and Henry, “One Culture: Computationally Intensive Research in the Humanities and Sciences.”


So, what are research libraries doing with this information? The Association of Research Libraries has a jobs announcements database. And Tito Sierra did a study last year analyzing 2011 job postings. He looked at 444 jobs descriptions. 7.4% of the jobs were “newly created or new to the organization.” New mgt level positions were significantly higher, while subject specialist jobs were under-represented.


Anne went through Tito’s data and found 13.5% have “digital” in the title. There were more digital humanities positions than e-science. She posts a lists of the new titles jobs are being given, and they’re digilicious. 55% of those positions call for a library science degree.


Anne concludes: It’s a growth area, with responsibilities more clearly defined in the sciences. There’s growing interest in serving the digital humanists. “Digital curation” is not common in the qualifications nomenclature. MLS or MLIS is not the only path. There’s a lot of interest in post-doctoral positions.


Margarita Gregg of the National Oceanic and Atmospheric Administration, begins by talking about challenges in the era of Big Data. They produce about 15 petabytes of data per year. It’s not just about Big Data, though. They are very concerned with data quality. They can’t preserve all versions of their datasets, and it’s important to keep track of the provenance of that data.


Margarita directs one of NOAA’s data centers that acquires, preserves, assembles, and provides access to marine data. They cannot preserve everything. They need multi-disciplinary people, and they need to figure out how to translate this data into products that people need. In terms of personnel, they need: Data miners, system architects, developers who can translate proprietary formats into open standards, and IP and Digital Rights Management experts so that credit can be given to the people generating the data. Over the next ten years, she sees computer science and information technology becoming the foundations of curation. There is no currently defined job called “digital curator” and that needs to be addressed.


Vicki Ferrini at the Lamont -Doherty Earth Observatory at Columbia University works on data management, metadata, discovery tools, educational materials, best practice guidelines for optimizing acquisition, and more. She points to the increased communication between data consumers and producers.


As data producers, the goal is scientific discovery: data acquisition, reduction, assembly, visualization, integration, and interpretation. And then you have to document the data (= metadata).


Data consumers: They want data discoverability and access. Inceasingly they are concerned with the metadata.


The goal of data providers is to provide acccess, preservation and reuse. They care about data formats, metadata standards, interoperability, the diverse needs of users. [I've abbreviated all these lists because I can't type fast enough.].


At the intersection of these three domains is the data scientist. She refers to this as the “data stewardship continuum” since it spans all three. A data scientist needs to understand the entire life cycle, have domain experience, and have technical knowledge about data systems. “Metadata is key to all of this.” Skills: communication and organization, understanding the cultural aspects of the user communities, people and project management, and a balance between micro- and macro perspectives.


Challenges: Hard to find the right balance between technical skills and content knowledge. Also, data producers are slow to join the digital era. Also, it’s hard to keep up with the tech.


Andy Maltz, Dir. of Science and Technology Council of Academy of Motion Picture Arts and Sciences. AMPA is about arts and sciences, he says, not about The Business.


The Science and Technology Council was formed in 2005. They have lots of data they preserve. They’re trying to build the pipeline for next-generation movie technologists, but they’re falling behind, so they have an internship program and a curriculum initiative. He recommends we read their study The Digital Dilemma. It says that there’s no digital solution that meets film’s requirement to be archived for 100 years at a low cost. It costs $400/yr to archive a film master vs $11,000 to archive a digital master (as of 2006) because of labor costs. [Did I get that right?] He says collaboration is key.


In January they released The Digital Dilemma 2. It found that independent filmmakers, documentarians, and nonprofit audiovisual archives are loosely coupled, widely dispersed communities. This makes collaboration more difficult. The efforts are also poorly funded, and people often lack technical skills. The report recommends the next gen of digital archivists be digital natives. But the real issue is technology obsolescence. “Technology providers must take archival lifetimes into account.” Also system engineers should be taught to consider this.


He highly recommends the Library of Congress’ “The State of Recorded Sound Preservation in the United States,” which rings an alarm bell. He hopes there will be more doctoral work on these issues.


Among his controversial proposals: Require higher math scores for MLS/MLIS students since they tend to score lower than average on that. Also, he says that the new generation of content creators have no curatorial awareness. Executivies and managers need to know that this is a core business function.


Demand side data points: 400 movies/year at 2PB/movie. CNN has 1.5M archived assets, and generates 2,500 new archive objects/wk. YouTube: 72 hours of video uploaded every minute.


Takeways:

  • Show business is a business.

  • Need does not necessarily create demand.

  • The nonprofit AV archive community is poorly organized.

  • Next gen needs to be digital natvies with strong math and sci skills.

  • The next gen of executive leaders needs to understand the importance of this.

  • Digital curation and long-term archiving need a business case.


Q&A


Q: How about linking the monetary value of the metadata to the metadata? That would encourage the generation of metadata.


Q: Weinberger paints a picture of flexible world of flowing data, and now we’re back in the academic, scientific world where you want good data that lasts. I’m torn.


A: Margarita: We need to look how that data are being used. Maybe in some circumstances the quality of the data doesn’t matter. But there are other instances where you’re looking for the highest quality data.


A: [audience] In my industry, one person’s outtakes are another person’s director cuts.


A: Anne: In the library world, we say if a little metadata would be great, a lot of it would be great. We need to step away from trying to capture the most to capturing the most useful (since can’t capture the most). And how do you produce data in a way that’s opened up to future users, as well as being useful for its primary consumers? It’s a very interesting balance that needs to be played. Maybe short-term need is a higher thing and long-term is lower.


A: Vicki: The scientists I work with use discrete data sets, spreadsheets, etc. As we get along we’ll have new ways to check the quality of datasets so we can use the messy data as well.


Q: Citizen curation? E.g., a lot of antiques are curated by being put into people’s attics…Not sure what that might imply as model. Two parallel models?


A: Margarita: We’re going to need to engage anyone who’s interested. We need to incorporate citizen corporation.


Anne: That’s already underway where people have particular interests. E.g., Cornell’s Lab of Ornithology where birders contribute heavily.


Q: What one term will bring people info about this topic?


A: Vicki: There isn’t one term, which speaks to the linked data concept.


Q: How will you recruit people from all walks of life to have the skills you want?


A: Andy: We need to convince people way earlier in the educational process that STEM is cool.


A: Anne: We’ll have to rely to some degree on post-hire education.


Q: My shop produces and integrates lots of data. We need people with domain and computer science skills. They’re more likely to come out of the domains.


A: Vicki: As long as you’re willing to take the step across the boundary, it doesn’t mater which side you start from.


Q: 7 yrs ago in library school, I was told that you need to learn a little programming so that you understand it. I didn’t feel like I had to add a whole other profession on to the one I was studying.

1 Comment »

July 7, 2012

[2b2k] Big Data needs Big Pipes

A post by Stacy Higginbotham at GigaOm talks about the problems moving Big Data across the Net so that it can be processed. She draws on an article by Mari Silbey at SmartPlanet. Mari’s example is a telescope being built on Cerro Pachon, a mountain in Chile, that will ship many high-resolution sky photos every day to processing centers in the US.

Stacy discusses several high-speed networks, and the possibility of compressing the data in clever ways. But a person on a mailing list I’m on (who wishes to remain anonymous) pointed to GLIF, the Global Lambda Integrated Facility, which rather surprisingly is not a cover name for a nefarious organization out to slice James Bond in two with a high-energy laser pointer.

The title of its “informational brochure” [pdf] is “Connecting research worldwide with lightpaths,” which helps some. It explains:

GLIF makes use of the cost and capacity advantages offered by optical multiplexing, in order to build an infrastructure that can take advantage of various processing, storage and instrumentation facilities around the world. The aim is to encourage the shared use of resources by eliminating the traditional performance bottlenecks caused by a lack of network capacity.

Multiplexing is the carrying of multiple signals at different wavelengths on a single optical fiber. And these wavelengths are known as … wait for it … lambdas. Boom!

My mailing list buddy says that GLIF provides “100 gigabit optical waves”, which compares favorably to your pathetic earthling (um, American) 3-20 megabit broadband connection,(maybe 50mb if you have FIOS), and he notes that GLIF is available in Chile.

To sum up: 1. Moving Big Data is an issue. 2. We are not at the end of innovating. 3. The bandwidth we think of as “high” in the US is a miserable joke.


By the way, you can hear an uncut interview about Big Data I did a few days ago for Breitband, a German radio program that edited, translated, and broadcast it.

2 Comments »

June 30, 2012

Amerzing pherters

Two sets of amazing photos:

Wikimedia Commons has announced its best photos of the year.

Here’s one I like. It’s by Simon Pierre Barrette.

Also, the New York Hall of Science is exhibiting the winners of the international The Olympus BioScapes Digital Imaging Competition. These are amerrrzing photos of microscopic subjects. Totally amahhzning. See them here.

Be the first to comment »

June 9, 2012

Bake sale for NASA

More than a dozen universities are holding bake sales for NASA. The aim is to raise awareness, not money.

To me, NASA is a bit like a public library: No matter what, you want your town and your country to visibly declare their commitment to the value of human curiosity.

 


In other science news, attempts to replicate the faster-than-light neutrino results have confirmed that the spunky little buggers obey the universal traffic limit.

The system works! Even if you don’t screw in the optical cables tightly.

Be the first to comment »

May 16, 2012

[2b2k] Peter Galison on The Collective Author

Harvard professor Peter Galison (he’s actually one of only 24 University Professors, a special honor) is opening a conference on author attribution in the digital age.

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

He points to the vast increase in the number of physicists involved in an experiment, some of which have 3,000 people working on them. This transforms the role of experiments and how physicists relate to one another. “When CERN says in a couple of months that ‘We’ve found the Higgs particle,’ who is the we?”

He says that there has been a “pseudo-I”: A group that functions under the name of a single author. A generation or two ago this was common: The Alvarez Group,” Thorndike Group, ” etc. This is like when the works of a Rembrandt would in fact come from his studio. But there’s also “The Collective Group”: a group that functions without that name — often without even a single lead institution.” This requires “complex internal regulation, governance, collective responsibility, and novel ways of attributing credit.” So, over the past decades physicists have been asked very fundamental questions about how they want to govern. Those 3,000 people have never all met one another; they’re not even in the same country. So, do they stop the accelerator because of the results from one group? Or, when CERN scientists found data suggesting faster than light neutrinos, the team was not unanimous about publishing those results. When the results were reversed, the entire team suffered some reputational damage. “So, the stakes are very high about how these governance, decision-making, and attribution questions get decided.”

He looks back to the 1960s. There were large bubble chambers kept above their boiling point but under pressure. You’d get beautiful images of particles, and these were the iconic images of physics. But these experiments were at a new, industrial scale for physics. After an explosion in 1965, the labs were put under industrial rules and processes. In 1967 Alan Thorndike at Brookhaven responded to these changes in the ethos of being an experimenter. Rarely is the experimenter a single individual, he said. He is a composite. “He might be 3, 5 or 8, possibly as many as 10, 20, or more.” He “may be spread around geographically…He may be epehemral…He is a social phenomenon, varied in form and impossible to define precisely.” But he certainly is not (said Thorndike) a “cloistered scientist working in isolation at his laboratory bench.” The thing that is thinking is a “composite entity.” The tasks are not partitioned in simple ways, the way contractors working on a house partition their tasks. Thorndike is talking about tasks in which “the cognition itself does not occur in one skull.”

By 1983, physicists were colliding beams that moved particles out in all directions. Bigger equipment. More particles. More complexity. Now instead of a dozen or two participants, you have 150 or so. Questions arose about what an author is. In July 1988 one of the Stanford collaborators wrote an internal memo saying that all collaborators ought to be listed as authors alphabetically since “our first priority should be the coherence of the group and the de facto recognition that contributions to a piece of physics are made by all collaborators in different ways.” They decided on a rule that avoided the nightmare of trying to give primacy to some. The memo continues: “For physics papers, all physicist members of the colaboration are authors. In addition, the first published paper should also include the engineers.” [Wolowitz! :)]

In 1990s rules of authorship got more specific. He points to a particular list of seven very specific rules. “It was a big battle.”

In 1997, when you get to projects as large as ATLAS at CERN, the author count goes up to 2,500. This makes it “harder to evaluate the individual contribution when comparing with other fields in science,” according to a report at the time. With experiments of this size, says Peter, the experimenters are the best source of the review of the results.

Conundrums of Authorship: It’s a community and you’re trying to keep it coherent. “You have to keep things from falling apart” along institutional or disciplinary grounds. E.g., the weak neutral current experiment. The collaborators were divided about whether there were such things. They were mockingly accused of proposing “alternating weak neutral currents,” and this cost them reputationally. But, trying to making these experiments speak in one voice can come at a cost. E.g., suppose 1,900 collaborators want to publish, but 600 don’t. If they speak in one voice, that suppresses dissent.

Then there’s also the question of the “identity of physicists while crediting mechanical, cryogenic, electrical engineers, and how to balance with builders and analysts.” E.g., analysts have sometimes claimed credit because they were the first ones to perceive the truth in the data, while others say that the analysts were just dealing with the “icing.”

Peter ends by saying: These questions go down to our understanding of the very nature of science.

Q: What’s the answer?
A: It’s different in different sciences, each of which has its own culture. Some of these cultures are still emerging. It will not be solved once and for all. We should use those cultures to see what part of evaluations are done inside the culture, and which depend on external review. As I said, in many cases the most serious review is done inside where you have access to all the data, the backups, etc. Figuring out how to leverage those sort of reviews could help to provide credit when it’s time to promote people. The question of credit between scientists and engineers/technicians has been debated for hundreds of years. I think we’ve begun to shed some our class anxiety, i.e., the assumption that hand work is not equivalent to head work, etc. A few years ago, some physicists would say that nanotech is engineering, not science; you don’t hear that so much any more. When a Nobel prize in 1983 went to an engineer, it was a harbinger.

Q: Have other scientists learned from the high energy physicists about this?
A: Yes. There are different models. Some big science gets assimilated to a culture that is more like abig engineering process. E.g., there’s no public awareness of the lead designers of the 747 we’ve been flying for 50 years, whereas we know the directors of Hollywood films. Authorship is something we decide. That the 747 has no author but Hunger Games does was not decreed by Heaven. Big plasma physics is treated more like industry, in part because it’s conducted within a secure facility. The astronomers have done many admirable things. I was on a prize committee that give the award to a group because it was a collective activity. Astronomers have been great about distributing data. There’s Galaxy Zoo, and some “zookeepers” have been credited as authors on some papers.

Q: The credits are getting longer on movies as the specializations grow. It’s a similar problem. They tell you how did what in each category. In high energy physics, scientists see becoming too specialized as a bad thing.
A: In the movies many different roles are recognized. And there are questions of distribution of profits, which is not so analogous to physics experiments. Physicists want to think of themselves as physicists, not as sub-specialists. If you are identified as, for example, the person who wrote the Monte Carlo, people may think that you’re “just a coder” and write you off. The first Ph.D. in physics submitted at Harvard was on the Bohr model; the student was told that it was fine but he had to do an experiment because theoretical physics might be great for Europe but not for the US. It’s naive to think that physicists are Da Vinci’s who do everything; the idea of what counts as being a physicist is changing, and that’s a good thing.

[I wanted to ask if (assuming what may not be true) the Internet leads to more of the internal work being done visibly in public, might this change some of the governance since it will be clearer that there is diversity and disagrement within a healthy network of experimenters. Anyway, that was a great talk.]

2 Comments »

May 14, 2012

Goodies from Wolfram

Some wonderfully interesting stuff from Stephen Wolfram today.

Here’s his Reddit IAMA.

A post about what’s become of a New Kind of Science in the past ten years. And a part two, about reactions to NKS.

And here’s a post from a couple of months ago that I missed that is, well, amazing. All I’ll say is that it’s about “personal analytics.”

Be the first to comment »

April 29, 2012

[2b2k] Pyramid-shaped publishing model results in cheating on science?

Carl Zimmer has a fascinating article in the NYTimes, which is worth 1/10th of your NYT allotment. (Thank you for ironically illustrating the problem with trying to maintain knowledge as a scarce resource, NYT!)

Carl reports on what may be a growing phenomenon (or perhaps, as the article suggests, the bugs of the old system may just now be more apparent) of scientists fudging results in order to get published in the top journals. From my perspective the article provides yet another illustration how the old paper-based strictures on scientific knowledge caused by the scarcity of publishing outlets results not only in a reduction in the flow of knowledge, but a degradation of the quality of knowledge.

Unfortunately, the availability of online journals (many of which are peer-reviewed) may not reduce the problem much even though they open up the ol’ knowledge nozzle to 11 on the firehosedial. As we saw when the blogosphere first emerged, there is something like a natural tendency for networked ecosystems to create hubs with a lot of traffic, along with a very long tail. So, even with higher capacity hubs, there may still be some pressure to fudge results in order to get noticed by these hubs, especially since tenure decisions continue to place such high value on a narrow understanding of “impact.”

But: 1. With a larger aperture, there may be less pressure. 2. When readers are also commentators and raters, bad science may be uncovered faster and more often. Or so we can hope.

(There is the very beginnings of a Reddit discussion of Carl’s article here.)

3 Comments »

April 23, 2012

[2b2] Structure of Scientific Revolutions, 50 years later

The Chronicle of Higher Ed asked me to write a perspective on Thomas Kuhn’s The Structure of Scientific Revolutions since this is the 50th year since it was published. It’s now posted.

1 Comment »

April 22, 2012

« Previous Page | Next Page »