Joho the Blog » everythingismisc

March 21, 2014

Reading Emily Dickinson’s metadata

There’s a terrific article by Helen Vendler in the March 24, 2014 New Republic about what can learn about Emily Dickinson by exploring her handwritten drafts. Helen is a Dickinson scholar of serious repute, and she finds revelatory significance in the words that were crossed out, replaced, or listed as alternatives, in the physical arrangement of the words on the page, etc. For example, Prof. Vendler points to the change of the line in “The Spirit” : “What customs hath the Air?” became “What function hath the Air?” She says that this change points to a more “abstract, unrevealing, even algebraic” understanding of “the future habitation of the spirit.”

Prof. Vendler’s source for many of the poems she points to is Emily Dickinson: The Gorgeous Nothings, by Marta Werner and Jen Bervin, the book she is reviewing. But she also points to the new online Dickinson collection from Amherst and Harvard. (The site was developed by the Berkman Center’s Geek Cave.)


Unfortunately, the New Republic article is not available online. I very much hope that it will be since it provides such a useful way of reading the materials in the online Dickinson collection which are themselves available under a CreativeCommons license that enables
non-commercial use without asking permission.

Be the first to comment »

December 14, 2013

Are tags over-rated?

Jeff Atwood [twitter:codinghorror] , a founder of Stackoverflow and Discourse.org — two of my favorite sites — is on a tear about tags. Here are his two tweets that started the discussion:

I am deeply ambivalent about tags as a panacea based on my experience with them at Stack Overflow/Exchange. Example: pic.twitter.com/AA3Y1NNCV9

Here’s a detweetified version of the four-part tweet I posted in reply:

Jeff’s right that tags are not a panacea, but who said they were? They’re a tool (frequently most useful when combined with an old-fashioned taxonomy), and if a tool’s not doing the job, then drop it. Or, better, fix it. Because tags are an abstract idea that exists only in particular implementations.

After all, one could with some plausibility claim that online discussions are the most overrated concept in the social media world. But still they have value. That indicates an opportunity to build a better discussion service. … which is exactly what Jeff did by building Discourse.org.

Finally, I do think it’s important — even while trying to put tags into a less over-heated perspective [do perspectives overheat??] — to remember that when first introduced in the early 2000s, tags represented an important break with an old and long tradition that used the authority to classify as a form of power. Even if tagging isn’t always useful and isn’t as widely applicable as some of us thought it would be, tagging has done the important work of telling us that we as individuals and as a loose collective now have a share of that power in our hands. That’s no small thing.

2 Comments »

July 6, 2013

[misc][2b2k] Why ontologies make me nervous

A few days ago there was a Twitter back and forth between two people I deeply respect: Dan Brickley [twitter:danbri] and Ed Summers [twitter:edsu]. It started with Ed responding to a tweet about a brief podcast I did with Kevin Ford [twitter:3windmills], who is on the team working on BibFrame:

After a couple of tweets, Dan tweeted the following:


There followed some agreement that it's often helpful to have apps driving the development of standards. (Kevin agrees with this, and points to BibFrame's process.) But, Dan's comment clarified my understanding of why ontologies make me nervous.

Over the past hundred years or so, we've come to a general recognition that all classifications and categorizations are tools, not representations of The Real Order. The periodic table of the elements is a useful way of organizing information, and manifests real relationships among the elements, but it is not the single "real" way the elements are arranged; if you're an economist or an industrialist, a chart that arranges the elements based on where they exist on our planet might be just as valid. Likewise, Linneaus' classification scheme is useful and manifests some real relationships, but if you're a chef you might have a different way of carving up the animal kingdom. Linneaus chose to organize species based upon visible differences — which might not be the "essential" differences — so that his scheme would be useful to scientists in the field. Although he was sometimes ambiguous about this, he seems not to have thought that he was discerning God's own order. Since Linnaeus we have become much more explicit in our understanding that how we classify depends on what we're trying to accomplish.

For example, a DTD (document type definition) typically is designed not to capture the eternal essence of some type of document, but to make the document more usable by systems that automate the document's production and processing. For example, an industry might agree on a DTD for parts catalogs that specifies that a parts catalog must have an element called "part" and that a part must have a type, part number, length, height, weight, material, and a description, and optionally can note whether it turns clockwise or counterclockwise. Each of these elements would have a standard name (e.g., "part_number," not "part#"). The result is a document that describes parts in a standard way so that a company can receive descriptions from all of its suppliers and automatically build a database of the parts it uses.

A DTD therefore is designed with an eye toward what properties are going to be useful. In some industries, it might include a term that captures how shiny the part is, but if it's a DTD for surgical equipment, that may not be relevant enough to include...although "sanitary_packaging" might be. Likewise, how quickly a bolt transfers heat might seem irrelevant, at least until NASA places an order. In this DTD's are much like forms: You don't put a field for earlobe length in the college application form you're designing.

Ontologies are different. They can try to express the structure of a domain independent of any particular use, so that the widest variety of applications can share data, including apps from domains outside of the one that's been mapped. So, to use Dan's example, your ontology of jobs would note that jobs have employers and workers, that they may have a salary or other form of compensation, that they can be part-time, full-time, seasonal, etc. As an ontology designer, because you're trying to think beyond whatever applications you already can imagine, your aim (often, not always) is to provide the fullest possible set of slots just in case someone sometime needs that info. And you will carefully describe the relationships among the elements so that apps and researchers can use knowledge that is implicit in the model.

The line between DTD's and ontologies is fuzzy. Many ontologies are designed with classes of apps in mind, and some DTD's have tried to be hugely general purpose. My discomfort really comes down to a distrust of the concept of "knowledge representation" that underlies some ontologies (especially earlier ones). The complexity of the relationships among parts will always outstrip our attempts to capture and codify those relationships. Further, knowledge cannot be fully represented because it isn't a thing apart from our continuous invention, discovery, and engagement with it.

What it comes down to is that if you talk about ontologies as knowledge representations I'll mutter something under my breath and change the topic.

6 Comments »

June 19, 2013

[lodlam] Dean Krafft on VIVO

Dean Krafft of Cornell talks about the status of VIVO, an interdisciplinary tool to help researchers discover one another.

This is from the LODLAM conference in Montreal.

1 Comment »

April 25, 2013

[eim][misc] Too big to categorize

Amanda Filipacchi has a great post at the New York Times about the problem with classifying American female novelists as American female novelists. That’s been going on at Wikipedia, with the result that the category American novelist was becoming filled predominantly with male novelists.

Part of this is undoubtedly due to the dumb sexism that thinks that “normal” novelists are men, and thus women novelists need to be called out. And even if the category male novelist starts being used, it still assumes that gender is a primary way of dividing up novelists, once you’ve segregated them by nation. Amanda makes both points.

From my point of view, the problem is inherent in hierarchical taxonomies. They require making decisions not only about the useful ways of slicing up the world, but also about which slices come first. These cuts reflect cultural and political values and have cultural and political consequences. They also get in the way of people who are searching with a different way of organizing the topic in mind. In a case like this, it’d be far better to attach tags to Wikipedia articles so that people can search using whatever parameters they need. That way we get better searchability, and Wikipedia hasn’t put itself in the impossible position of coming up with a taxonomy that is neutral to all points of view.

Wikipedia’s categories have been broken for a long time. We know this in the Library Innovation Lab because a couple of years ago we tried to find every article in Wikipedia that is about a book. In theory, you can just click on the “Book” category. In practice, the membership is not comprehensive. The categories are inconsistent and incomplete. It’s just a mess.

It may be that a massive crowd cannot develop a coherent taxonomy because of the differences in how people think about things. Maybe the crowd isn’t massive enough. Or maybe the process just needs far more guidance and regulation. But even if the crowd can bring order to the taxonomy, I don’t believe it can bring neutrality, because taxonomies are inherently political.

There are problems with letting people tag Wikipedia articles. Spam, for example. And without constraints, people can lard up an object with tags that are meaningful only to them, offensive, or wrong. But there are also social mechanisms for dealing with that. And we’ve been trained by the Web to lower our expectations about the precision and recall afforded by tags, whereas our expectations are high for taxonomies.

Go tags.

9 Comments »

April 18, 2013

[misc] StackLife goes live – visually browse millions of books

I’m very proud to announce that the Harvard Library Innovation Lab (which I co-direct) has launched what we think is a useful and appealing way to browse books at scale. This is timed to coincide with the launch today of the Digital Public Library of America. (Congrats, DPLA!!!)

StackLife (nee ShelfLife) shows you a visualization of books on a scrollable shelf, which we turn sideways so you can read the spines. It always shows you books in a context, on the ground that no book stands alone. You can shift the context instantly, so that you can (for example) see a work on a shelf with all the other books classified under any of the categories professional cataloguers have assigned to it.

We also heatmap the books according to various usage metrics (“StackScore”), so you can get a sense of the work’s community relevance.

There are lots more features, and lots more to come.

We’ve released two versions today.

StackLife DPLA mashes up the books in the Digital Public Library of America’s collection (from the Biodiversity Heritage Library) with books from The Internet Archive‘s Open Library and the Hathi Trust. These are all online, accessible books, so you can just click and read them. There are 1.7M in the StackLife DPLA metacollection. (Development was funded in part by a Sprint grant from the DPLA. Thank you, DPLA!)

StackLife Harvard lets you browse the 12.3M books and other items in the Harvard Library systems 73 libraries and off-campus repository. This is much less about reading online (unfortunately) than about researching what’s available.

Here are some links:

StackLife DPLA: http://stacklife-dpla.law.harvard.edu
StackLife Harvard: http://stacklife.law.harvard.edu
The DPLA press release: http://library.harvard.edu/stacklife-browse-read-digital
The DPLA version FAQ: http://stacklife-dpla.law.harvard.edu/#faq/

The StackLife team has worked long and hard on this. We’re pretty durn proud:

Annie Cain
Paul Deschner
Kim Dulin
Jeff Goldenson
Matthew Phillips
Caleb Troughton

4 Comments »

April 16, 2013

[misc][2b2k] Making Twitter better for disasters

I had both CNN and Twitter on yesterday all afternoon, looking for news about the Boston Marathon bombings. I have not done a rigorous analysis (nor will I, nor have I ever), but it felt to me that Twitter put forward more and more varied claims about the situation, and reacted faster to misstatements. CNN plodded along, but didn’t feel more reliable overall. This seems predictable given the unfiltered (or post-filtered) nature of Twitter.

But Twitter also ran into some scaling problems for me yesterday. I follow about 500 people on Twitter, which gives my stream a pace and variety that I find helpful on a normal day. But yesterday afternoon, the stream roared by, and approached filter failure. A couple of changes would help:

First, let us sort by most retweeted. When I’m in my “home stream,” let me choose a frequency of tweets so that the scrolling doesn’t become unwatchable; use the frequency to determine the threshold for the number of retweets required. (Alternatively: simply highlight highly re-tweeted tweets.)

Second, let us mute based on hashtag or by user. Some Twitter cascades I just don’t care about. For example, I don’t want to hear play-by-plays of the World Series, and I know that many of the people who follow me get seriously annoyed when I suddenly am tweeting twice a minute during a presidential debate. So let us temporarily suppress tweet streams we don’t care about.

It is a lesson of the Web that as services scale up, they need to provide more and more ways of filtering. Twitter had “follow” as an initial filter, and users then came up with hashtags as a second filter. It’s time for a new round as Twitter becomes an essential part of our news ecosystem.

1 Comment »

July 19, 2012

[2b2k][eim]Digital curation

I’m at the “Symposium on Digital Curation in the Era of Big Data” held by the Board on Research Data and Information of the National Research Council. These liveblog notes cover (in some sense — I missed some folks, and have done my usual spotty job on the rest) the morning session. (I’m keynoting in the middle of it.)

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.


Alan Blatecky [pdf] from the National Science Foundation says science is being transformed by Big Data. [I can't see his slides from the panel at front.] He points to the increase in the volume of data, but we haven’t paid enough attention to the longevity of the data. And, he says, some data is centralized (LHC) and some is distributed (genomics). And, our networks are unable to transport large amounts of data [see my post], making where the data is located quite significant. NSF is looking at creating data infrastructures. “Not one big cloud in the sky,” he says. Access, storage, services — how do we make that happen and keep it leading edge? We also need a “suite of policies” suitable for this new environment.


He closes by talking about the Data Web Forum, a new initiative to look at a “top-down governance approach.” He points positively to the IETF’s “rough consensus and running code.” “How do we start doing that in the data world?” How do we get a balanced representation of the community? This is not a regulatory group; everything will be open source, and progress will be through rough consensus. They’ve got some funding from gov’t groups around the world. (Check CNI.org for more info.)


Now Josh Greenberg from the Sloan Foundation. He points to the opportunities presented by aggregated Big Data: the effects on social science, on libraries, etc. But the tools aren’t keeping up with the computational power, so researchers are spending too much time mastering tools, plus it can make reproducibility and provenance trails difficult. Sloan is funding some technical approaches to increasing the trustworthiness of data, including in publishing. But Sloan knows that this is not purely a technical problem. Everyone is talking about data science. Data scientist defined: Someone who knows more about stats than most computer scientists, and can write better code than typical statisticians :) But data science needs to better understand stewardship and curation. What should the workforce look like so that the data-based research holds up over time? The same concerns apply to business decisions based on data analytics. The norms that have served librarians and archivists of physical collections now apply to the world of data. We should be looking at these issues across the boundaries of academics, science, and business. E.g., economics works now rests on data from Web businesses, US Census, etc.

[I couldn't liveblog the next two — Michael and Myron — because I had to leave my computer on the podium. The following are poor summaries.]

Michael Stebbins, Assistant Director for Biotechnology in the Office of Science and Technology Policy in the White House, talked about the Administration’s enthusiasm for Big Data and open access. It’s great to see this degree of enthusiasm coming directly from the White House, especially since Michael is a scientist and has worked for mainstream science publishers.


Myron Gutmann, Ass’t Dir of of the National Science Foundation likewise expressed commitment to open access, and said that there would be an announcement in Spring 2013 that in some ways will respond to the recent UK and EC policies requiring the open publishing of publicly funded research.


After the break, there’s a panel.


Anne Kenney, Dir. of Cornell U. Library, talks about the new emphasis on digital curation and preservation. She traces this back at Cornell to 2006 when an E-Science task force was established. She thinks we now need to focus on e-research, not just e-science. She points to Walters and Skinners “New Roles for New Times: Digital Curation for Preservation.” When it comes to e-research, Anne points to the need for metadata stabilization, harmonizing applications, and collaboration in virtual communities. Within the humanities, she sees more focus on curation, the effect of the teaching environment, and more of a focus on scholarly products (as opposed to the focus on scholarly process, as in the scientific environment).


She points to Youngseek Kim et al. “Education for eScience Professionals“: digital curators need not just subject domain expertise but also project management and data expertise. [There's lots of info on her slides, which I cannot begin to capture.] The report suggests an increasing focus on people-focused skills: project management, bringing communities together.


She very briefly talks about Mary Auckland’s “Re-Skilling for Research” and Williford and Henry, “One Culture: Computationally Intensive Research in the Humanities and Sciences.”


So, what are research libraries doing with this information? The Association of Research Libraries has a jobs announcements database. And Tito Sierra did a study last year analyzing 2011 job postings. He looked at 444 jobs descriptions. 7.4% of the jobs were “newly created or new to the organization.” New mgt level positions were significantly higher, while subject specialist jobs were under-represented.


Anne went through Tito’s data and found 13.5% have “digital” in the title. There were more digital humanities positions than e-science. She posts a lists of the new titles jobs are being given, and they’re digilicious. 55% of those positions call for a library science degree.


Anne concludes: It’s a growth area, with responsibilities more clearly defined in the sciences. There’s growing interest in serving the digital humanists. “Digital curation” is not common in the qualifications nomenclature. MLS or MLIS is not the only path. There’s a lot of interest in post-doctoral positions.


Margarita Gregg of the National Oceanic and Atmospheric Administration, begins by talking about challenges in the era of Big Data. They produce about 15 petabytes of data per year. It’s not just about Big Data, though. They are very concerned with data quality. They can’t preserve all versions of their datasets, and it’s important to keep track of the provenance of that data.


Margarita directs one of NOAA’s data centers that acquires, preserves, assembles, and provides access to marine data. They cannot preserve everything. They need multi-disciplinary people, and they need to figure out how to translate this data into products that people need. In terms of personnel, they need: Data miners, system architects, developers who can translate proprietary formats into open standards, and IP and Digital Rights Management experts so that credit can be given to the people generating the data. Over the next ten years, she sees computer science and information technology becoming the foundations of curation. There is no currently defined job called “digital curator” and that needs to be addressed.


Vicki Ferrini at the Lamont -Doherty Earth Observatory at Columbia University works on data management, metadata, discovery tools, educational materials, best practice guidelines for optimizing acquisition, and more. She points to the increased communication between data consumers and producers.


As data producers, the goal is scientific discovery: data acquisition, reduction, assembly, visualization, integration, and interpretation. And then you have to document the data (= metadata).


Data consumers: They want data discoverability and access. Inceasingly they are concerned with the metadata.


The goal of data providers is to provide acccess, preservation and reuse. They care about data formats, metadata standards, interoperability, the diverse needs of users. [I've abbreviated all these lists because I can't type fast enough.].


At the intersection of these three domains is the data scientist. She refers to this as the “data stewardship continuum” since it spans all three. A data scientist needs to understand the entire life cycle, have domain experience, and have technical knowledge about data systems. “Metadata is key to all of this.” Skills: communication and organization, understanding the cultural aspects of the user communities, people and project management, and a balance between micro- and macro perspectives.


Challenges: Hard to find the right balance between technical skills and content knowledge. Also, data producers are slow to join the digital era. Also, it’s hard to keep up with the tech.


Andy Maltz, Dir. of Science and Technology Council of Academy of Motion Picture Arts and Sciences. AMPA is about arts and sciences, he says, not about The Business.


The Science and Technology Council was formed in 2005. They have lots of data they preserve. They’re trying to build the pipeline for next-generation movie technologists, but they’re falling behind, so they have an internship program and a curriculum initiative. He recommends we read their study The Digital Dilemma. It says that there’s no digital solution that meets film’s requirement to be archived for 100 years at a low cost. It costs $400/yr to archive a film master vs $11,000 to archive a digital master (as of 2006) because of labor costs. [Did I get that right?] He says collaboration is key.


In January they released The Digital Dilemma 2. It found that independent filmmakers, documentarians, and nonprofit audiovisual archives are loosely coupled, widely dispersed communities. This makes collaboration more difficult. The efforts are also poorly funded, and people often lack technical skills. The report recommends the next gen of digital archivists be digital natives. But the real issue is technology obsolescence. “Technology providers must take archival lifetimes into account.” Also system engineers should be taught to consider this.


He highly recommends the Library of Congress’ “The State of Recorded Sound Preservation in the United States,” which rings an alarm bell. He hopes there will be more doctoral work on these issues.


Among his controversial proposals: Require higher math scores for MLS/MLIS students since they tend to score lower than average on that. Also, he says that the new generation of content creators have no curatorial awareness. Executivies and managers need to know that this is a core business function.


Demand side data points: 400 movies/year at 2PB/movie. CNN has 1.5M archived assets, and generates 2,500 new archive objects/wk. YouTube: 72 hours of video uploaded every minute.


Takeways:

  • Show business is a business.

  • Need does not necessarily create demand.

  • The nonprofit AV archive community is poorly organized.

  • Next gen needs to be digital natvies with strong math and sci skills.

  • The next gen of executive leaders needs to understand the importance of this.

  • Digital curation and long-term archiving need a business case.


Q&A


Q: How about linking the monetary value of the metadata to the metadata? That would encourage the generation of metadata.


Q: Weinberger paints a picture of flexible world of flowing data, and now we’re back in the academic, scientific world where you want good data that lasts. I’m torn.


A: Margarita: We need to look how that data are being used. Maybe in some circumstances the quality of the data doesn’t matter. But there are other instances where you’re looking for the highest quality data.


A: [audience] In my industry, one person’s outtakes are another person’s director cuts.


A: Anne: In the library world, we say if a little metadata would be great, a lot of it would be great. We need to step away from trying to capture the most to capturing the most useful (since can’t capture the most). And how do you produce data in a way that’s opened up to future users, as well as being useful for its primary consumers? It’s a very interesting balance that needs to be played. Maybe short-term need is a higher thing and long-term is lower.


A: Vicki: The scientists I work with use discrete data sets, spreadsheets, etc. As we get along we’ll have new ways to check the quality of datasets so we can use the messy data as well.


Q: Citizen curation? E.g., a lot of antiques are curated by being put into people’s attics…Not sure what that might imply as model. Two parallel models?


A: Margarita: We’re going to need to engage anyone who’s interested. We need to incorporate citizen corporation.


Anne: That’s already underway where people have particular interests. E.g., Cornell’s Lab of Ornithology where birders contribute heavily.


Q: What one term will bring people info about this topic?


A: Vicki: There isn’t one term, which speaks to the linked data concept.


Q: How will you recruit people from all walks of life to have the skills you want?


A: Andy: We need to convince people way earlier in the educational process that STEM is cool.


A: Anne: We’ll have to rely to some degree on post-hire education.


Q: My shop produces and integrates lots of data. We need people with domain and computer science skills. They’re more likely to come out of the domains.


A: Vicki: As long as you’re willing to take the step across the boundary, it doesn’t mater which side you start from.


Q: 7 yrs ago in library school, I was told that you need to learn a little programming so that you understand it. I didn’t feel like I had to add a whole other profession on to the one I was studying.

1 Comment »

July 4, 2012

[eim] XKCD goes miscellaneous

Except Randall Munroe thinks going miscellaneous means giving up, rather than embracing the new organizational possibilities of blah blah blah.

(I am, of course, an awestruck fan of XKCD.)

1 Comment »

May 7, 2012

[everythingismisc] Scaling Japan

MetaFilter popped up a three-year-old post from Derek Sivers about how streeet addresses work in Japan. The system does a background-foreground duck-rabbit Gestalt flip on Western addressing schemes. I’d already heard about it — book-larnin’ because I’ve never been to Japan — but the post got me thinking about how things scale up.

What we would identify by street address, the Japanese identify by house number within a block name. Within a block, the addresses are non-sequential, reflecting instead the order of construction.

I can’t remember where I first read about this (I’m pretty sure I wrote about it in Everything Is Miscellaneous), but it pointed out some of the assumptions and advantages of this systems: it assumes local knowledge, confuses invaders, etc. But my reaction then was the same as when I read Derek’s post this morning: Yeah, but it doesn’t scale. Confusing invaders is a positive outcome of a failure to scale, but getting tourists lost is not. The math just doesn’t work: 4 streets intersected by 4 avenues creates 9 blocks, but add just 2 more streets and 2 more avenues and you’ve enclosed another 16 blocks. So, to navigate a large western city you have to know many many fewer streets and avenues than the number of existing blocks.

But of course I’m wrong. Tokyo hasn’t fallen apart because there are too many blocks to memorize. Clearly the Japanese system does scale.

In part that’s because according to the Wikipedia article on it, blocks are themselves located within a nested set of named regions. So you can pop up the geographic hierarchy to a level where there are fewer entities in order to get a more general location, just as we do with towns, counties, states, countries, solar system, galaxy, the universe.

But even without that, the Japanese system scales in ways that peculiarly mirror how the Net scales. Computers have scaled information in the Western city way: bits are tucked into chunks of memory that have sequential addresses. (At least they did the last time I looked in 1987.) But the Internet moves packets to their destinations much the way a Japanese city’s inhabitants might move inquiring visitors along: You ask someone (who we will call Ms. Router) how to get to a particular place, and Ms. Router sends you in a general direction. After a while you ask another person. Bit by bit you get closer, without anyone having a map of the whole.

At the other end of the stack of abstraction, computers have access to such absurdly large amounts of information either locally or in the cloud — and here namespaces are helpful — that storing the block names and house numbers for all of Tokyo isn’t such a big deal. Point your mobile phone to Google Maps’ Tokyo map if you need proof. With enough memory,we do not need to scale physical addresses by using schemes that reduce it to streeets and avenues. We can keep the arrangement random and just look stuff up. In the same way, we can stock our warehouses in a seemingly random order and rely on our computers to tell us where each item is; this has the advantage of letting us put the most requested items up front, or on the shelves that require humans to do the least bending or stretching.

So, I’m obviously wrong. The Japanese system does scale. It just doesn’t scale in the ways we used when memory spaces were relatively small.

3 Comments »

Next Page »