Joho the Blog » everythingIsMiscellaneous

April 3, 2011

Social tagging games ‘n research

The GiveALink link-sharing site has posted two games thaty are actually research studies.

The first game is GiveALink Slider which the site says “is an interesting online tagging game in which you must annotate webpages with related tags and choose new webpages. You can accumulate points and win badges by accomplishing tasks and building links with other players.” They are giving iPods to the winners. It’s actually a study called “Social Annotations through Game Play” conducted by the Networks and Agents Network in the Center for Complex Networks and Systems Research of the Indiana University School of Informatics
Here’s the description of the second game:

Great Minds Think Alike is a word association game that lets users build semantic concept networks and explore similarity relations.

Players form a chain of semantically related words, which comes from the GiveALink knowledge base. Users can browse through nine different social media, e.g. Flickr and Youtube, and earn points.

Words are geo-tagged, which helps to analyze the geographical distribution of terms. Players can also connect with other players via Facebook as suggested by the game.

Data from the game is collected by GiveALink.org to make the game more fun, support other social tagging applications, and for study purposes.

No, I don’t actually understand how either game works, and I haven’t signed up for them because the first one is a study that I don’t want to commit to and the second requires an iPhone. But, the GiveALink service is interesting. It’s an open bookmark-sharing service that also feeds a research program. [Hat tip to Julianne Chatelain.]

Follow me

Be the first to comment »

March 15, 2011

Can there be too much information? And what would it be too much of?

As PR for an upcoming appearance by James Gleick, whose new book The Information I am greatly looking forward to reading, Zocalo Public Square asked four or five folks “Can there be too much information?” It’s an interesting collection of responses. (Well, mine excepted.)

And underneath these interesting-in-themselves essays runs a different question when they are taken together: What the heck do we mean by “information” anyway? I’m not sure any of the respondents is defining it in the same way. The ways include: opinions, raw data, words, ideas, photos, switches and dials, and books. Of course, some of these are containers of information or examples of information. But they do not reduce to a single definition. (I believe Gleick’s book is at least in part about this ambiguity about information. It’s also something I’ve been researching for the past couple of years.)

As far as my contribution goes, I had to decide whether to provide an Everything Is Miscellaneous answer (we are learning to organize info in new ways) or a Too Big to Know answer (the quantity of info is changing the nature of knowledge). I went with the new book rather than the old, if only because I wrote the tiny essay within minutes after finishing revising the book manuscript.

Follow me

Categories: everythingIsMiscellaneous, too big to know Tagged with: 2b2k • everythingIsMiscellaneous • gleick Date: March 15th, 2011 dw

1 Comment »

March 4, 2011

[2b2k] Tagging big data

According to an article in Science Insider by Dennis Normile, a group formed at a symposium sponsored by the Board on Global Science and Technology, of the National Research Council, an arm of the U.S. National Academies [that’s all they’ve got??] is proposing making it easier to find big scientific data sets by using a standard tag, along with a standard way of conveying the basic info about the nature of the set, and its terms of use. “The group hopes to come up with a protocol within a year that researchers creating large data sets will voluntarily adopt. The group may also seek the endorsement of the Internet Engineering Task Force…”

Follow me

Categories: everythingIsMiscellaneous, science, too big to know Tagged with: 2b2k • big data • everythingIsMiscellaneous • science • standards Date: March 4th, 2011 dw

2 Comments »

February 20, 2011

[2b2k] Public data and metadata, Google style

I’m finding Google Lab’s Dataset Publishing Language (DSPL) pretty fascinating.

Upload a set of data, and it will do some semi-spiffy visualizations of it. (As Apryl DeLancey points out, Martin Wattenberg and Fernanda Viegas now work for Google, so if they’re working on this project, the visualizations are going to get much better.) More important, the data you upload is now publicly available. And, more important than that, the site wants you to upload your data in Google’s DSPL format. DSPL aims at getting more metadata into datasets, making them more understandable, integrate-able, and re-usable.

So, let’s say you have spreadsheets of “statistical time series for unemployment and population by country, and population by gender for US states.” (This is Google’s example in its helpful tutorial.)

You would supply a set of concepts (“population”), each with a unique ID (“pop”), a data type (“integer”), and explanatory information (“name=population”, “definition=the number of human beings in a geographic area”). Other concepts in this example include country, gender, unemployment rate, etc. [Note that I’m not using the DSPL syntax in these examples, for purposes of readability.]
For concepts that have some known set of members (e.g., countries, but not unemployment rates), you would create a table — a spreadsheet in CSV format — of entries associated with that concept.
If your dataset uses one of the familiar types of data, such as a year, geographical position, etc., you would reference the “canonical concepts” defined by Google.
You create a “slice” or two, that is, “a combination of concepts for which data exists.” A slice references a table that consists of concepts you’ve already defined and the pertinent values (“dimensions” and “metrics” in Google’s lingo). For example, you might define a “countries slice” table that on each row lists a country, a year, and the country’s population in that year. This table uses the unique IDs specified in your concepts definitions.
Finally, you can create a dataset that defines topics hierarchically so that users can more easily navigate the data. For example, you might want to indicate that “population” is just one of several characteristics of “country.” Your topic dataset would define those relations. You’d indicate that your “population” concept is defined in the topic dataset by including the “population topic” ID (from the topic dataset) in the “population” concept definition.

When you’re done, you have a data set you can submit to Google Public Data Explorer, where the public can explore your data. But, more important, you’ve created a dataset in an XML format that is designed to be rich in explanatory metadata, is portable, and is able to be integrated into other datasets.

Overall, I think this is a good thing. But:

While Google is making its formats public, and even its canonical definitions are downloadable, DSPL is “fully open” for use, but fully Google’s to define. Having the 800-lbs gorilla defining the standard is efficient and provides the public platform that will encourage acceptance. And because the datasets are in XML, Google Public Data Explorer is not a roach motel for data. Still, it’d be nice if we could influence the standard more directly than via an email-the-developers text box.
Defining topics hierarchically is a familiar and useful model. I’m curious about the discussions behind the scenes about whether to adopt or at least enable ontologies as well as taxonomies.
Also, I’m surprised that Google has not built into this standard any expectation that data will be sourced. Suppose the source of your US population data is different from the source of your European unemployment statistics? Of course you could add links into your XML definitions of concepts and slices. But why isn’t that a standard optional element?
Further (and more science fictional), it’s becoming increasingly important to be able to get quite precise about the sources of data. For example, in the library world, the bibliographic data in MARC records often comes from multiple sources (local cataloguers, OCLC, etc.) and it is turning out to be a tremendous problem that no one kept track of who put which datum where. I don’t know how or if DSPL addresses the sourcing issue at the datum level. I’m probably asking too much. (At least Google didn’t include a copyright field as standard for every datum.)

Overall, I think it’s a good step forward.

Follow me

Categories: everythingIsMiscellaneous, too big to know Tagged with: 2b2k • big data • everythingIsMiscellaneous • google • metadata • standards • xml Date: February 20th, 2011 dw

1 Comment »

February 10, 2011

[misc] The US GAAP Taxonomy is Miscellaneous

Well, here’s an application of some of the ideas in Everything is Miscellaneous that I wasn’t expecting: The US GAAP Taxonomy. A post at the XBRL Business Information Exchange says:

The US GAAP Taxonomy was built by the accounting standards setter, the FASB. It was built by accountants. It is a consensus-based product. Not one SEC XBRL filer uses the US GAAP Taxonomy as is to file with the SEC. Every SEC reorganizes the US GAAP Taxonomy.

But the US GAAP Taxonomy is not built to be reorganized. The structure of the taxonomy is more like a book. Can the US GAAP Taxonomy be reorganized? Of course it can. But it is certainly not optimized to allow for reorganization and reorganization is not even mentioned in the design characteristics. As such, it will cost more and be harder to create and maintain these reorganizations.

So how do you make it easier to reorganize? Many smaller pieces which can be put together as needed is vastly easier for a computer to deal with than having one large piece and trying to break that piece apart. That is one example of what can be done. Another is communicating the metadata which exists in the taxonomy, for example the information modeling patterns employed. A third is to make the existing metadata real metadata, rather than burying it in the labels of the concepts. Another is to add more metadata.

The post points out that it’s not that everything about that taxonomy should thrown into a big pile. There are key data points required by law and to achieve financial integrity. Still, this is not a place I would have thought miscellanizing would help. It seems, however, that I may well be happily wrong.

Follow me

Categories: business, everythingIsMiscellaneous Tagged with: business • everythingIsMiscellaneous • finance • gaap • sec • taxonomies • xbrl Date: February 10th, 2011 dw

1 Comment »

February 7, 2011

[2b2k][misc] Choose your ski resort authority

Great Ski Holidays lets you search for a place you want to go skiing using a faceted system, so you can specify tags such as alpine, beginner, nightlife, and spa. (For my ideal ski resort, the tags would be: free, low, and indoors.) It seems well done, but the thing I really like about it is that you can choose which authorities you want to use: ski review sites, ski resorts & club sites, trade sites & tour operators, and (coming soon) reader reviews.

The site started out as a demo of “Authority Driven Facet Tags” by an enterprise search agency called Metaphor Search. It went so well that they opened it up to the Web public, although it still shows some signs of its demo origins, including some typos, etc. It just adds to the charm.

One of their blog posts actually credits Everything Is Miscellaneous as one of the inspirations, which makes me happy. The post says part of the impetus for developing a faceted system with configurable authorities was experiencing the difficulty of coming up with a single, uncontested geographical classification for the Maldives: Asia? Indian Ocean? And it got worse when they tried to come up with a taxonomy of destination types. So, rather than try to figure out what each user’s unexpressed taxonomy is, they decided to let the user decide which authorities to trust and use those authorities’ ways of divvying up the world. Clever, and not unlike the multi-taxonomy approach taken by some species-of-the-world sites.

Follow me

Categories: everythingIsMiscellaneous, too big to know Tagged with: 2b2k • everythingIsMiscellaneous • ski Date: February 7th, 2011 dw

2 Comments »

January 10, 2011

Visualizing Wikipedia deletions

Notabilia has visualized the hundred longest discussion threads at Wikipedia that resulted in the deletion of an article and the hundred that did not. The visualized threads take on shapes depending on whether the discussion was controversial, swinging, or unanimous. For those whose brains can process visualized information (as mine cannot), you will undoubtedly learn much. For the rest of us: Oooooh, pretty!

They’ve posted some other analyses as well. For example, “The analysis [pdf] of a large sample of AfD discussions (200K discussions that took place between November 2002 and July 2010) suggests that the largest part of these discussions ends after only a few recommendations are expressed.” And: “Delete decisions tend to be fairly unanimous. In contrast, we found many Keep decisions resulting from a discussion that leaned towards deletion…”

Follow me

Categories: everythingIsMiscellaneous, too big to know Tagged with: 2b2k • everythingIsMiscellaneous • wikipedia Date: January 10th, 2011 dw

1 Comment »

January 9, 2011

Near- and far-in-laws

Keith Dawson has a suggestion for disambiguating “in-law,” which can refer to (for example), your wife’s brother or your sister’s husband. He’s got near-in-laws and far-in-laws. Very handy.

And it raises the question of why English doesn’t already have an easy way of making this distinction. Are we so binary about our family relations that we just don’t give a damn-in-law?

Follow me

Categories: everythingIsMiscellaneous Tagged with: everythingIsMiscellaneous Date: January 9th, 2011 dw

7 Comments »

December 28, 2010

[2b2k] Citizen scientists

Alex Wright has an excellent article in the New York Times today about the great work being done by citizen scientists. (Alex follows up in his blog with some more worthy citizen science efforts.)

Alex, who I met a few years ago at a conference because we had written books on similar topics — his excellent Glut and my Everything Is Miscellaneous — quotes me a couple of times in the article. The first time, I say that the people who are gathering data and classifying images “are not doing the work of scientists.” Some in the comments have understandably taken issue with that characterization. It’s something I deal with at some length in Too Big to Know. Because of the curtness of the comment, it could easily be taken as dismissive, which was not my intent; these volunteers are making a real contribution, as Alex’s article documents. But, in many of the projects Alex discusses (and that I discuss in my manuscript), the volunteers are doing work for which they need no scientific training. They are doing the work of science — gathering data certainly counts — but not the work of scientists. But that’s what makes it such an exciting time: You don’t need a degree or even training beyond the instructions on a Web page, and you can be part of a collective effort that advances science. (Commenter kc I think makes a good argument against my position on this.)

FWIW, the origins of my participation in the article were a discussion with Alex about why in this age of the amateur it’s so hard to find the sort of serious leap in scientific thinking coming from amateurs. Amateurs drove science more in the 19th century than now. Of course, that’s not an apple to apples comparison because of the professionalization of science in the 20th century. Also, so much of basic science now requires access to equipment far too expensive for amateurs. (Although that’s scarily not the case for gene sequencers.)

Follow me

Categories: science, too big to know Tagged with: 2b2k • citizen science • everythingIsMiscellaneous • science Date: December 28th, 2010 dw

2 Comments »

December 11, 2010

Ordering your video store

Roger Beebe has posted a fascinating, polemical explanation of the thinking behind the way he physically arranged his Gainesville, Florida video store. He takes educating his visitors as an obligation of the layout. Here’s an excerpt:

There’s a pedagogy to this arrangement, and it’s clearly making a case for a certain kind of engagement with the cinema and with film history. The prevailing first-order logic is one of national cinemas as a way of thinking about large groups of films together. Within those national cinemas, there’s a decidedly auteurist bent, privileging works by significant directors (toward the start of each section) followed by non-auteurist works from those regions. US films get further important subdivisions based on the mode of production and circulation; they are subdivided into Sub-indie (underground, avant garde, etc.), Independent (following the standard nomenclature of that fraught area), and Hollywood. Hollywood is then subdivided further between auteurist works (with a breakdown stretching from Woody Allen to Robert Zemeckis) and non-auteurist works that are then subdivided by genre.

An additional strategyâ€”and this may be more ideological than pedagogicalâ€”is the arrangement of sections from the front of the store to the rear. The store has a narrow central corridor with small alcoves of videos along each side. We consciously front-loaded the store with documentaries on one side and our Sub-indie section on the other. The more mainstream Hollywood fare is pushed much further back in the store, forcing anyone seeking out those titles to run the gauntlet past all of these alternative cinemas.

Roger makes reference to Everything Is Miscellaneous throughout, a book about which he has at best mixed feelings. He understandably takes it as an unabashed, “boosterish” argument in favor of the multiple categorizations and sortings that the digitizing and networking of information enables. But, I disagree with part of his interpretation of the book. I did not intend to argue against careful organization of physical goods (the prologue waxes enthusiastic about Staples’ store layout) or against the value of expertly curated collections. Rather, we benefit on the Web from having expert curations as well as curations by multiple, multiple experts, both professional and amateur. Mortimer Adler’s Great Books would have been a welcome addition to the Web, but it would have been only one of many “playlists.” The fact that Adler’s list would have had to compete with those of UnNamed_Teenager at Amazon is a serious problem on the Net, but it’s balanced by the unavoidable harm done during the Reign of Paper by the impact Adler’s list had on which books were actually printed and placed in libraries.

Of course, I’m responsible for not having communicated my intentions adequately.

Follow me

Categories: everythingIsMiscellaneous Tagged with: everythingIsMiscellaneous Date: December 11th, 2010 dw

Be the first to comment »

« Previous Page | Next Page »