Joho the Blogeverythingismisc Archives - Page 2 of 2 - Joho the Blog

April 16, 2013

[misc][2b2k] Making Twitter better for disasters

I had both CNN and Twitter on yesterday all afternoon, looking for news about the Boston Marathon bombings. I have not done a rigorous analysis (nor will I, nor have I ever), but it felt to me that Twitter put forward more and more varied claims about the situation, and reacted faster to misstatements. CNN plodded along, but didn’t feel more reliable overall. This seems predictable given the unfiltered (or post-filtered) nature of Twitter.

But Twitter also ran into some scaling problems for me yesterday. I follow about 500 people on Twitter, which gives my stream a pace and variety that I find helpful on a normal day. But yesterday afternoon, the stream roared by, and approached filter failure. A couple of changes would help:

First, let us sort by most retweeted. When I’m in my “home stream,” let me choose a frequency of tweets so that the scrolling doesn’t become unwatchable; use the frequency to determine the threshold for the number of retweets required. (Alternatively: simply highlight highly re-tweeted tweets.)

Second, let us mute based on hashtag or by user. Some Twitter cascades I just don’t care about. For example, I don’t want to hear play-by-plays of the World Series, and I know that many of the people who follow me get seriously annoyed when I suddenly am tweeting twice a minute during a presidential debate. So let us temporarily suppress tweet streams we don’t care about.

It is a lesson of the Web that as services scale up, they need to provide more and more ways of filtering. Twitter had “follow” as an initial filter, and users then came up with hashtags as a second filter. It’s time for a new round as Twitter becomes an essential part of our news ecosystem.

1 Comment »

July 19, 2012

[2b2k][eim]Digital curation

I’m at the “Symposium on Digital Curation in the Era of Big Data” held by the Board on Research Data and Information of the National Research Council. These liveblog notes cover (in some sense — I missed some folks, and have done my usual spotty job on the rest) the morning session. (I’m keynoting in the middle of it.)

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

Alan Blatecky [pdf] from the National Science Foundation says science is being transformed by Big Data. [I can’t see his slides from the panel at front.] He points to the increase in the volume of data, but we haven’t paid enough attention to the longevity of the data. And, he says, some data is centralized (LHC) and some is distributed (genomics). And, our networks are unable to transport large amounts of data [see my post], making where the data is located quite significant. NSF is looking at creating data infrastructures. “Not one big cloud in the sky,” he says. Access, storage, services — how do we make that happen and keep it leading edge? We also need a “suite of policies” suitable for this new environment.

He closes by talking about the Data Web Forum, a new initiative to look at a “top-down governance approach.” He points positively to the IETF’s “rough consensus and running code.” “How do we start doing that in the data world?” How do we get a balanced representation of the community? This is not a regulatory group; everything will be open source, and progress will be through rough consensus. They’ve got some funding from gov’t groups around the world. (Check for more info.)

Now Josh Greenberg from the Sloan Foundation. He points to the opportunities presented by aggregated Big Data: the effects on social science, on libraries, etc. But the tools aren’t keeping up with the computational power, so researchers are spending too much time mastering tools, plus it can make reproducibility and provenance trails difficult. Sloan is funding some technical approaches to increasing the trustworthiness of data, including in publishing. But Sloan knows that this is not purely a technical problem. Everyone is talking about data science. Data scientist defined: Someone who knows more about stats than most computer scientists, and can write better code than typical statisticians :) But data science needs to better understand stewardship and curation. What should the workforce look like so that the data-based research holds up over time? The same concerns apply to business decisions based on data analytics. The norms that have served librarians and archivists of physical collections now apply to the world of data. We should be looking at these issues across the boundaries of academics, science, and business. E.g., economics works now rests on data from Web businesses, US Census, etc.

[I couldn’t liveblog the next two — Michael and Myron — because I had to leave my computer on the podium. The following are poor summaries.]

Michael Stebbins, Assistant Director for Biotechnology in the Office of Science and Technology Policy in the White House, talked about the Administration’s enthusiasm for Big Data and open access. It’s great to see this degree of enthusiasm coming directly from the White House, especially since Michael is a scientist and has worked for mainstream science publishers.

Myron Gutmann, Ass’t Dir of of the National Science Foundation likewise expressed commitment to open access, and said that there would be an announcement in Spring 2013 that in some ways will respond to the recent UK and EC policies requiring the open publishing of publicly funded research.

After the break, there’s a panel.

Anne Kenney, Dir. of Cornell U. Library, talks about the new emphasis on digital curation and preservation. She traces this back at Cornell to 2006 when an E-Science task force was established. She thinks we now need to focus on e-research, not just e-science. She points to Walters and Skinners “New Roles for New Times: Digital Curation for Preservation.” When it comes to e-research, Anne points to the need for metadata stabilization, harmonizing applications, and collaboration in virtual communities. Within the humanities, she sees more focus on curation, the effect of the teaching environment, and more of a focus on scholarly products (as opposed to the focus on scholarly process, as in the scientific environment).

She points to Youngseek Kim et al. “Education for eScience Professionals“: digital curators need not just subject domain expertise but also project management and data expertise. [There’s lots of info on her slides, which I cannot begin to capture.] The report suggests an increasing focus on people-focused skills: project management, bringing communities together.

She very briefly talks about Mary Auckland’s “Re-Skilling for Research” and Williford and Henry, “One Culture: Computationally Intensive Research in the Humanities and Sciences.”

So, what are research libraries doing with this information? The Association of Research Libraries has a jobs announcements database. And Tito Sierra did a study last year analyzing 2011 job postings. He looked at 444 jobs descriptions. 7.4% of the jobs were “newly created or new to the organization.” New mgt level positions were significantly higher, while subject specialist jobs were under-represented.

Anne went through Tito’s data and found 13.5% have “digital” in the title. There were more digital humanities positions than e-science. She posts a lists of the new titles jobs are being given, and they’re digilicious. 55% of those positions call for a library science degree.

Anne concludes: It’s a growth area, with responsibilities more clearly defined in the sciences. There’s growing interest in serving the digital humanists. “Digital curation” is not common in the qualifications nomenclature. MLS or MLIS is not the only path. There’s a lot of interest in post-doctoral positions.

Margarita Gregg of the National Oceanic and Atmospheric Administration, begins by talking about challenges in the era of Big Data. They produce about 15 petabytes of data per year. It’s not just about Big Data, though. They are very concerned with data quality. They can’t preserve all versions of their datasets, and it’s important to keep track of the provenance of that data.

Margarita directs one of NOAA’s data centers that acquires, preserves, assembles, and provides access to marine data. They cannot preserve everything. They need multi-disciplinary people, and they need to figure out how to translate this data into products that people need. In terms of personnel, they need: Data miners, system architects, developers who can translate proprietary formats into open standards, and IP and Digital Rights Management experts so that credit can be given to the people generating the data. Over the next ten years, she sees computer science and information technology becoming the foundations of curation. There is no currently defined job called “digital curator” and that needs to be addressed.

Vicki Ferrini at the Lamont -Doherty Earth Observatory at Columbia University works on data management, metadata, discovery tools, educational materials, best practice guidelines for optimizing acquisition, and more. She points to the increased communication between data consumers and producers.

As data producers, the goal is scientific discovery: data acquisition, reduction, assembly, visualization, integration, and interpretation. And then you have to document the data (= metadata).

Data consumers: They want data discoverability and access. Inceasingly they are concerned with the metadata.

The goal of data providers is to provide acccess, preservation and reuse. They care about data formats, metadata standards, interoperability, the diverse needs of users. [I’ve abbreviated all these lists because I can’t type fast enough.].

At the intersection of these three domains is the data scientist. She refers to this as the “data stewardship continuum” since it spans all three. A data scientist needs to understand the entire life cycle, have domain experience, and have technical knowledge about data systems. “Metadata is key to all of this.” Skills: communication and organization, understanding the cultural aspects of the user communities, people and project management, and a balance between micro- and macro perspectives.

Challenges: Hard to find the right balance between technical skills and content knowledge. Also, data producers are slow to join the digital era. Also, it’s hard to keep up with the tech.

Andy Maltz, Dir. of Science and Technology Council of Academy of Motion Picture Arts and Sciences. AMPA is about arts and sciences, he says, not about The Business.

The Science and Technology Council was formed in 2005. They have lots of data they preserve. They’re trying to build the pipeline for next-generation movie technologists, but they’re falling behind, so they have an internship program and a curriculum initiative. He recommends we read their study The Digital Dilemma. It says that there’s no digital solution that meets film’s requirement to be archived for 100 years at a low cost. It costs $400/yr to archive a film master vs $11,000 to archive a digital master (as of 2006) because of labor costs. [Did I get that right?] He says collaboration is key.

In January they released The Digital Dilemma 2. It found that independent filmmakers, documentarians, and nonprofit audiovisual archives are loosely coupled, widely dispersed communities. This makes collaboration more difficult. The efforts are also poorly funded, and people often lack technical skills. The report recommends the next gen of digital archivists be digital natives. But the real issue is technology obsolescence. “Technology providers must take archival lifetimes into account.” Also system engineers should be taught to consider this.

He highly recommends the Library of Congress’ “The State of Recorded Sound Preservation in the United States,” which rings an alarm bell. He hopes there will be more doctoral work on these issues.

Among his controversial proposals: Require higher math scores for MLS/MLIS students since they tend to score lower than average on that. Also, he says that the new generation of content creators have no curatorial awareness. Executivies and managers need to know that this is a core business function.

Demand side data points: 400 movies/year at 2PB/movie. CNN has 1.5M archived assets, and generates 2,500 new archive objects/wk. YouTube: 72 hours of video uploaded every minute.


  • Show business is a business.

  • Need does not necessarily create demand.

  • The nonprofit AV archive community is poorly organized.

  • Next gen needs to be digital natvies with strong math and sci skills.

  • The next gen of executive leaders needs to understand the importance of this.

  • Digital curation and long-term archiving need a business case.


Q: How about linking the monetary value of the metadata to the metadata? That would encourage the generation of metadata.

Q: Weinberger paints a picture of flexible world of flowing data, and now we’re back in the academic, scientific world where you want good data that lasts. I’m torn.

A: Margarita: We need to look how that data are being used. Maybe in some circumstances the quality of the data doesn’t matter. But there are other instances where you’re looking for the highest quality data.

A: [audience] In my industry, one person’s outtakes are another person’s director cuts.

A: Anne: In the library world, we say if a little metadata would be great, a lot of it would be great. We need to step away from trying to capture the most to capturing the most useful (since can’t capture the most). And how do you produce data in a way that’s opened up to future users, as well as being useful for its primary consumers? It’s a very interesting balance that needs to be played. Maybe short-term need is a higher thing and long-term is lower.

A: Vicki: The scientists I work with use discrete data sets, spreadsheets, etc. As we get along we’ll have new ways to check the quality of datasets so we can use the messy data as well.

Q: Citizen curation? E.g., a lot of antiques are curated by being put into people’s attics…Not sure what that might imply as model. Two parallel models?

A: Margarita: We’re going to need to engage anyone who’s interested. We need to incorporate citizen corporation.

Anne: That’s already underway where people have particular interests. E.g., Cornell’s Lab of Ornithology where birders contribute heavily.

Q: What one term will bring people info about this topic?

A: Vicki: There isn’t one term, which speaks to the linked data concept.

Q: How will you recruit people from all walks of life to have the skills you want?

A: Andy: We need to convince people way earlier in the educational process that STEM is cool.

A: Anne: We’ll have to rely to some degree on post-hire education.

Q: My shop produces and integrates lots of data. We need people with domain and computer science skills. They’re more likely to come out of the domains.

A: Vicki: As long as you’re willing to take the step across the boundary, it doesn’t mater which side you start from.

Q: 7 yrs ago in library school, I was told that you need to learn a little programming so that you understand it. I didn’t feel like I had to add a whole other profession on to the one I was studying.

1 Comment »

July 4, 2012

[eim] XKCD goes miscellaneous

Except Randall Munroe thinks going miscellaneous means giving up, rather than embracing the new organizational possibilities of blah blah blah.

(I am, of course, an awestruck fan of XKCD.)

1 Comment »

May 7, 2012

[everythingismisc] Scaling Japan

MetaFilter popped up a three-year-old post from Derek Sivers about how streeet addresses work in Japan. The system does a background-foreground duck-rabbit Gestalt flip on Western addressing schemes. I’d already heard about it — book-larnin’ because I’ve never been to Japan — but the post got me thinking about how things scale up.

What we would identify by street address, the Japanese identify by house number within a block name. Within a block, the addresses are non-sequential, reflecting instead the order of construction.

I can’t remember where I first read about this (I’m pretty sure I wrote about it in Everything Is Miscellaneous), but it pointed out some of the assumptions and advantages of this systems: it assumes local knowledge, confuses invaders, etc. But my reaction then was the same as when I read Derek’s post this morning: Yeah, but it doesn’t scale. Confusing invaders is a positive outcome of a failure to scale, but getting tourists lost is not. The math just doesn’t work: 4 streets intersected by 4 avenues creates 9 blocks, but add just 2 more streets and 2 more avenues and you’ve enclosed another 16 blocks. So, to navigate a large western city you have to know many many fewer streets and avenues than the number of existing blocks.

But of course I’m wrong. Tokyo hasn’t fallen apart because there are too many blocks to memorize. Clearly the Japanese system does scale.

In part that’s because according to the Wikipedia article on it, blocks are themselves located within a nested set of named regions. So you can pop up the geographic hierarchy to a level where there are fewer entities in order to get a more general location, just as we do with towns, counties, states, countries, solar system, galaxy, the universe.

But even without that, the Japanese system scales in ways that peculiarly mirror how the Net scales. Computers have scaled information in the Western city way: bits are tucked into chunks of memory that have sequential addresses. (At least they did the last time I looked in 1987.) But the Internet moves packets to their destinations much the way a Japanese city’s inhabitants might move inquiring visitors along: You ask someone (who we will call Ms. Router) how to get to a particular place, and Ms. Router sends you in a general direction. After a while you ask another person. Bit by bit you get closer, without anyone having a map of the whole.

At the other end of the stack of abstraction, computers have access to such absurdly large amounts of information either locally or in the cloud — and here namespaces are helpful — that storing the block names and house numbers for all of Tokyo isn’t such a big deal. Point your mobile phone to Google Maps’ Tokyo map if you need proof. With enough memory,we do not need to scale physical addresses by using schemes that reduce it to streeets and avenues. We can keep the arrangement random and just look stuff up. In the same way, we can stock our warehouses in a seemingly random order and rely on our computers to tell us where each item is; this has the advantage of letting us put the most requested items up front, or on the shelves that require humans to do the least bending or stretching.

So, I’m obviously wrong. The Japanese system does scale. It just doesn’t scale in the ways we used when memory spaces were relatively small.


March 26, 2012

Kew Gardens adopts Web principles for real-world wayfinding

In a paper Natasha Waterson and Mike Saunders describe how Kew Botanical Gardens in England are adopting mobile technology to help visitors become “delightfully lost.” From the abstract:

In October 2010, Kew Gardens commissioned an in-depth study of visitors’ motivations and information needs around its 300-acre site, with the express aim that it should guide the development of new mobile apps. The work involved over 1,500 visitor-tracking observations, 350 mini-interviews, 200 detailed exit interviews, and 85 fulfilment maps; and gave Kew an incredibly useful insight into its visitors’ wants, needs, and resulting behaviours.

It turns out that most Kew visitors have social, emotional, and spiritual, rather than intellectual, motivations during their time here. They do not come hoping to find out more, and they don’t want or need to know precisely where they are all the time. In fact, they love the sense of unguided exploration and the serendipitous discoveries they make at Kew—they want to become “delightfully lost.”

But as I read the actual paper, I was repeatedly struck by how often one could swap “in the Gardens” for “on the Web.” The motivations, the cognitive space, the tools and techniques often mirrored the Web’s. Indeed, one could argue that our experience of the Web is affecting how we view wayfinding in the real world, and not just because the Kew project integrates the offline and online worlds via mobiles, QRcodes, etc. Rather, the sense of serendipity, the loose connections, the desire to be able to follow one’s interests, the expectation that one will always be able to get more information about something, and the desire to contribute back — this is a public space we’re building together — all feel webby. Indeed, the paper’s overall point is that architects of information spaces ought not pick a single motive for those spaces’ users, and that is one of the fundamental lessons of the newly miscellanized world.

(Hat-tip to Hanan Cohen for the link.)


June 24, 2011

Tagging the National Archives

The National Archives is going all tag-arrific on us:

The Online Public Access prototype (OPA) just got an exciting new feature — tagging! As you search the catalog, we now invite you to tag any archival description, as well as person and organization name records, with the keywords or labels that are meaningful to you. Our hope is that crowdsourcing tags will enhance the content of our online catalog and help you find the information you seek more quickly.

Nice! (Hat tip to Infodocket for the tip)

Comments Off on Tagging the National Archives

April 3, 2011

Social tagging games ‘n research

The GiveALink link-sharing site has posted two games thaty are actually research studies.

The first game is GiveALink Slider which the site says “is an interesting online tagging game in which you must annotate webpages with related tags and choose new webpages. You can accumulate points and win badges by accomplishing tasks and building links with other players.” They are giving iPods to the winners. It’s actually a study called “Social Annotations through Game Play” conducted by the Networks and Agents Network in the Center for Complex Networks and Systems Research of the Indiana University School of Informatics
Here’s the description of the second game:

Great Minds Think Alike is a word association game that lets users build semantic concept networks and explore similarity relations.

Players form a chain of semantically related words, which comes from the GiveALink knowledge base. Users can browse through nine different social media, e.g. Flickr and Youtube, and earn points.

Words are geo-tagged, which helps to analyze the geographical distribution of terms. Players can also connect with other players via Facebook as suggested by the game.

Data from the game is collected by to make the game more fun, support other social tagging applications, and for study purposes.

No, I don’t actually understand how either game works, and I haven’t signed up for them because the first one is a study that I don’t want to commit to and the second requires an iPhone. But, the GiveALink service is interesting. It’s an open bookmark-sharing service that also feeds a research program. [Hat tip to Julianne Chatelain.]

Comments Off on Social tagging games ‘n research

« Previous Page