Joho the Blogresearch Archives - Joho the Blog

May 7, 2015

Facebook, filtering, polarization, and a flawed study?

Facebook researchers have published an article in Science, certainly one of the most prestigious peer-reviewed journals. It concludes (roughly) that Facebook’s filtering out of news from sources whose politics you disagree with does not cause as much polarization as some have thought.

Unfortunately, a set of researchers clustered around the Berkman Center think that the study’s methodology is deeply flawed, and that its conclusions badly misstate the actual findings. Here are three responses well worth reading:

Also see Eli Pariser‘s response.

1 Comment »

October 25, 2013

[dplafest] Advanced Research and the DPLA

I’m at a DPLAfest session. Jean Bauer (Digital Humanities Librarian, Brown U.), Jim Egan (English Prof, Brown), Kathryn Shaughnessy (Assoc. Prof, University Libraries, St. John’s U), and David Smth (Ass’t Prof CS, Northeastern).

Rather than liveblogging in this blog, I contributed to the collaboratively-written Google Doc designated for the session notes. It’s here.

Comments Off on [dplafest] Advanced Research and the DPLA

September 18, 2009

[berkman] Transforming Scholarly Communication

Lee Dirks [site] Director of Education and Scholarly Communication at Microsoft External Research is giving a Berkman-sponsored talk on “Transforming Scholarly Communications.” His group works with various research groups “to develop functionality that we think would benefit the community overall,” with Microsoft possibly as a facilitator. (Alex Wade from his group is also here.)

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

He begins by noting the “data deluge.” But, compuing is stepping up to the problem: Massive data sets, evolution of multicore, and the power of the cloud. We’ll need all that (Lee says) because the workflow for processing all the new info we’re gathering hasn’t kept up with the amount we’re taking in via sensor networks, global databases, laboratory instruments, desktops, etc. He points to the Life Under Your Feet project at Johns Hopkins as an example. They have 200 wireless computers, each with 10 sensors, monitoring air and soil temperature and moisture, and much more. (Microsoft funds it.) Lee recommends Joe Hellerstein’s blog if you’re interested in “the commoditization of massive data analysis.” We’re at the very early stages of this, Lee says. For e-scientists and e-researchers, there’s just too much: too much data, too much workflow, too much “opportunity.”


We need to move upstream in the research lifecycle: 1. collect data and do research, 2. author it, 3. publish, and then 4. store and archive it. That store then feeds future research and analysis. Lee says this four-step lifecycle needs collaboration and discovery. Libraries and archives spend most of their time in stage 4, but they ought to address the problems much early on. The most advanced thinkers are working on these earlier stages.


“The trick there is integration.” Some domains are quite proprietary about their data, which makes it problematic to get data and curation standards so that the data can move from system to system. From Microsoft’s perspective, the question is how can they move from static summaries to much richer information vehicles. Why can’t a research reports be containers that facilitate reproducible science? It should help you use your methodology against its data set. Alter data and see the results, and then share it. Collaborate real time with other researchers. Capture reputation and influence. Dynamic documents. [cf. Interleaf Active Documents, circa 1990. The dream still lives!]


On the commercial side, Elsevier has been running an “Article of the Future Competition.” Other examples: PLoS Currents: Influenza. Nature Preceedings. Google Wave. Mendeley (“iTunes for academic papers”). These are “chinks in the armor of the peer review system.”


Big changes, Lee says. We’ll see more open access and new economic models, particularly adding services on top of content. We’ll see a world in which data is increasingly easily sharable. E.g., the Sloan Digital Sky Survey ios a prototyupe in data publishing: 350M web hits in 6yrs, 930k distinct users, 10k astronmers, delivered 100B rows of data. Likewise, GalaxyZoo.org at which the public can classify galaxies and occasionally discover a new object or two.


Lee points to challenges with data sharing: integrating it, annotating, maintaining provenance and quality, exporting in agreed formats, security. These issues have stopped some from sharing data, and have forced some communities to remain proprietary. “The people who can address these problems in creative ways” will be market leaders moving forward.


Lee points to some existing sharing and analysis services. Swivel, IBM’s Many Eyes, Google’s Gapminder, Freebase, CSA’s Illustra…


The business models are shifting. Publishers are now thinking about data sharing services. IBM and RedHat provides an interesting model: Giving the code away but selling services. Repositories will contain not only the full text versions of reserach papers, but also “gray” literature “such as technical reports and theses,” and real-time streaming data, images and software. We need enhanced interoperability protocols.


E.g., Data.gov provides a searchable data catalog that provides access through the raw data and using various tools. Lee also likes WorldWideScience.org, “a global science gateway” to international scientific databases. Sxty-sevenety countries are pooling their scientific data and providing federated search.


Lee believes that semantic computing will provide fantastic results, although it may take a while. He points to Cameron Neylon’s discussion of the need to generate lab report feeds. (Lee says the Semantic Web is just one of the tools that cojuld be used for semantics-based computing,.) So, how do we take advantage of this? Recommender systems, as at Last.fm and Amazon. Connotea and BioMedCentral’s Faculty of 1000 are early examples of this [LATER: Steve Pog’s comment below says Faculty of 1000 is not owned by BioMedCentral] . Lee looks forward to the automatic correlation of scientific data and the “smart composition of services and functionality,” in which the computers do the connecting. And we’re going to need the cloud to do this sort of thing, both for the computing power and for the range of services that can be brought to bear on the distributed collection of data.


Lee spends some time talkingabout the cloud. Among other points, he points to SciVee and Viddler as interesting examples. Also, SmugMug as a photo aggregator that owns none of its own infrastructure. Also Slideshare and Google Docs. But these aren’t quite what researchers need, which is an opportunity. Also interesting: NSF DataNet grants.


When talking about preservation and provenance, Lee cites DuraSpace and its project, DuraCloud. It’s a cross-repository space with services added. Institutions pay for the service.


Lee ends by pointing to John Wilbanks‘ concern about the need for a legal and policy infrastructure that enables and encourages sharing. Lee says that at the end of the day, it’s not software, but providing incentives and rewards to get people to participate.


Q: How soon will this happen?
A: We can’t predict which domains will arise and which ones people will take to.


Q: What might bubble up from the consumer sector?
A: It’s an amazing space to watch. There are lots of good examples already?


Q: [me] This is great to have you proselytizing outside. But as an internal advocate inside Microsoft, what does Msft still have to do, and what’s the push back?
A: We’ve built 6-8 add-ins for Word for semantic markup, scholarly writing, consumption of ontologies. A repository platform. An open source foundation separate from Micrsooft, contributing to Linux kernel, etc.

Q: You’d be interested in Dataverse.org.
A: Yes, it sounds like it.


Q: Data is agnostic, but how articles aren’t…
A: We’re trying to figure out how to embed and link. But we’re also thinking about how you do it without the old containers, on the Web, in Google Wave, etc.
Q: Are you providing a way to ID relationships?
A: In part. For people using their ordinary tools (e.g., Word), we’re providing ways to import ontologies, share them with the repository or publisher, etc.


Q: How’s auto-tagging coming? The automatic creation of semantically correct output?
A: We’re working on this. A group at Oxford doing cancer research allows researchers to semantically annotate within Excel, so that the spreadsheet points to an ontology that specifies the units, etc. Fluxnet.org is an example of collaborative curation within a single framework.


Q: Things are blurring. Traditionally libraries collect, select and preserve schoilarly info. What do you think the role of the library will be?
A: I was an academic librarian. In my opinion, the safe world of collecting library journals has been done. We know how to do it. The problem these days is data curation, providing services, working with publishers.
Q: It still takes a lot of money…
A: Definitely. But the improvements are incremental. The bigger advances come further up the stream.

Q: Some cultures will resist sharing…
A: Yes. It’ll vary from domain to domain, and within domains. In some cases we’ll have to wait a generation.


Q: What skills would you give a young librarian?
A: I don’t have a pat answer for you. But, a service orientation would help, building services on top of the data, for example. Multi-disciplinary partnerships.


Q: You’re putting more info online. Are you seeing the benefit of that?
A: Most researchers already have Microsoft software, so we’re not putting the info up in order to sell more. We’re trying to make sure researchers know what’s there for them.

8 Comments »

August 5, 2009

Media Cloud unclouds media

The NY Times has a terrific article about Media Cloud, a Berkman Center project (hats off to Ethan Zuckerman, Yochai Benkler, Hal Roberts, among others) that will let researchers track the actual movement of ideas through the mediasphere and blogosphere.

Data about concepts! What a concept!

[Tags: ]

Comments Off on Media Cloud unclouds media

February 21, 2008

[cyberinf] Cyber-enabled knowledge

Peter Freeman of the Washington Advisory Group introduces the first panel, on “Cyber-Enabled Knowledge”, by asking how the infrastructure can support the university’s essence as the creator, transmitter and preserver of knowledge. [As always, I’m paraphrasing, typing quickly, and undoubtedly getting things wrong.]

Guru Parulkar of Stanford says that we must build the cyberinf on the right foundation. That’s one that enables many layers. It requires supporting the end-to-end principle because that facilitates innovation. We should make the infrastructure programmable so that providers can give users empowering services. [Seems non end-to-end to me. But I think he’s talking about university infrastructure providers enabling experimental services, not having, say, Comcast build services.] It’s not enough to deploy vendors’ infrastructure on campuses. The CIO and researchers ought to get together on this.

Simon Porter, eScholarship Research Center, U of Melbourne, wonders what the world looks like when we can find about all the research going on in our university. We could manage portfolios of research under an overall university agenda. [Hmm. Possibly scary.] They could develop a data research plan. The university could plan its storage needs. The way the research is represented to the public will change: it won’t be left to the researchers to be the lead communicator about the project. There will be a single portal — like Amazon or eBay, perhaps — where you can find out about research. We will be able to evaluate research by the effect it has on other projects. Researchers will be able to cooperate more, especially if there are standards. Crystallographers have software that lets people annotate online models; this is promising.

Q: Simon and Guru both pointed to gaps between network engineering folks and the CIO. What’s blocking progress here?
Simon: It’s not a natural progression. You have to take a leap.
Guru: The infrastructure is so complex, there’s a reluctance to “muck it up.” But at Stanford there’s a lot of openness.
Peter: Market forces will bring about the healing of the gap.

Q: The infrastructure didn’t arrive on a gold cloud all at once. It’s built on standards. In a recent survey, only 30 universities (G7) had courses on standards. Standards aren’t taught or shared at universities.
Guru: I disagree with you completely. Universities should be doing research much before people think about standards.
A [we’ve been asked not to identify speakers without asking permission :( ]: The U’s are incredibly creative now. I believe the next thing will come primarily out of U’s. Things bubble up, and the standards follow after that.
Simon: Standards are fundamentally important for development of cyberinf.

Q: How do we change the research processes to take advantage of the new cyberinfrastructure. This is not a decision for the CIOs but for the college presidents, etc.
Peter: By acclamation, we agree.


Q: [me] Knowledge currently reflects the old infrastructure: You get published or not. Knowledge is binary, fenced in and managed. How will the new infrastructure change the nature of knowledge itself?
Simon: Especially with shared standards, research can be more open.
Peter: Simon has proposed a specific way to make available info about current reseach projects. That’s key to enabling cooperation and the development of standards.
Guru: The cyberinfs we deploy on our campuses should allow experimentation in networking, cooperation, etc. That type of infrastructure doesn’t exist because we haven’t been asking for that leel of programmability and flexibility.

Q (John Wilbanks): When we try to move from network standards to knowledge standards, we get into semantics. It’s hard to have enduring semantics because they change as research happens. We could have project-based standards and allow people to share what they mean about something, not just sharing the content. So we have to change the idea of standards. [Go John!

Q: Is it the U’s role to fund research into infrastructure? You can’t make a case to the provost unless you show some dollars coming from somewhere.
Guru: Yes, someone has to pay for it. Maybe vendor partnerships will help.
Simon: If it’s strategically important to the U, the U ought to do it.

A: I’m in bioinformatics. BTW, my U doesn’t teach any of the standards. Anyway, industry folks tell us we’re training students to be like you, not to be what we in industry need. E.g., not team players. How can we make more industry-academic partnerships?

A: There is something big going on that we don’t understand. We’re good at big networks, etc., but we don’t understand how to solve problems for small groups of collaborating domain scientists. Universities don’t just store, transfer and develop knowledge…

I direct one of the portals where project-based info can be shared. People keep asking what the incentive is for professors. Right now the reward structures are not geared towards publishing on the Internet. What can be done to fix the incentive system?
Simon: Making info available is always going to be a chore to researchers. But Facebook makes it possible for marketers to find info based on participation by users. We need something equivalent for researchers, surfacing info about projects without requiring additional work by the researchers.
Guru: If it’s a problem of aggregation? People are very eager to make their work public. Where is the disconnection?
Peter: It largely depends on the field.

A: I develop provenance metadata in my field. There are problems. Ontologies don’t exist yet. They require expertise in RDF as well as domain expertise, and that’s hard to find in the same person. The ontologies have to be developed internationally.

A: Maybe there are some Web 1.0 opportunities that haven’t been take advantage of yet. E.g., we could make available to any NSF researcher a Web page at the NSF site. That would also provide some authentication.
Simon: It’s not a web page. Every researcher needs a persistent identifier. [researcher or proejct??

A: Standards that have followed research experimentation and productization have been the most successful. E.g., Internet, LANs, the Web. The most spectacular standards failure was the OSI in the 1980s because they did it before they had the sw and the experiments.

A: At my [hardware infrastructure] company, we do a lot of rolling out of products internally that are not quite ready. We are probably more willing to risk failure than universities are. And we are seeing more demand for programmable infrastructure hardware.

I urge us to adopt a more expansive, active and empirically-grounded notion of infrastructure. We shouldn’t think of infrastructure as being primarily hardware. 1. The layer model encourages thinking of the hardware as the “real” stuff. 2. We need to be teaching our students the practices by which interoperability is made possible. The standards in ten years will be different, but the tensions and dynamics will stay roughly the same. 3. We should learn from previous attempts to build infrastructure.

A: Infrastructure is extremely important but that occurs in a multicultural environment that we should bear in mind. Second, it all comes down to open access. [Tags: ]

2 Comments »

February 12, 2008

Harvard to vote on open access proposal

The NY Times reports that Harvard’s Faculty of Arts and Sciences will vote next week on a proposal that would require faculty to deposit a copy of their articles in an open access Harvard repository even as they submit those articles to academic journals.

I like this idea a lot. I only wish it went further. Faculty members will be allowed to opt-out of the requirement pretty much at will (as I understand it), which could vitiate it: If a prestigious journal accepts an article but only if it’s not been made openly available, faculty members may well decide it’s more important for their careers to be published in the journal. I would prefer to see the Harvard proposal paired with some form of official encouragement to tenure committees to look favorably upon faculty members who make their work widely and freely available.

Nothing is without drawbacks. A well-run, reliable, thorough peer-review system costs money. But there’s also an expense to funding peer review by limiting access to the work that makes it through the process. Likewise, while the current publication system directs our attention efficiently, but there’s a price to the very efficiency of such a system: innovation can arise from what looked liked inefficiencies. There’s value in the long tail of research.

If we were today building a system for evaluating scholarly research and for making it maximally available, we would not build anything like the current paper-based system. Well, we are building such a system. The Harvard proposal will, in my opinion, help.

Disclosure: I’m a fellow at the Berkman Center which is part of the Law School, not the Faculty of Arts and Sciences, and I’m not a faculty member in any case. Stuart Shieber, one of the sponsors of the proposal, is a director of the Center.) [Tags: ]

8 Comments »