April 5, 2010

Shirky’s myth of complexity

Clay Shirky has given us a surprising number of Internet myths. And by this I mean not falsehoods but the opposite: Broad, illuminating ways of making sense of what’s going on. For example, Clay’s post about the power law distribution of links in the blogosphere (based on research by Cameron Marlow) changed how we view authority, fame, and success in the Web ecosystem, and provided the structure within which Chris Anderson could point to the Long Tail. And Clay’s Ontology Is Overrated made clear that a change in how we categorize our world affects very real power relationships; that essay was highly influential, including on my own Everything Is Miscellaneous.

Clay’s new post — The Collapse of Complex Business Models — gives us a broad way of understanding why those who used to provide us with content will not be the ones who give us content in the future…and why they cannot fathom why not.


September 8, 2009

Google Books metadata: Google responds

There’s a terrific colloquy between Google and Geoff Nunberg in response to Geoff’s critique of Google’s handling of the metadata attached to the books Google is digitizing (which I blogged about here). It’s fascinating for its content, but also very cool as a conversation between a company and its market. Of course, it would have been even better if Google had initiated this conversation when it started its digitization project.

September 6, 2009

Data and metadata: Together again

Terry Jones has an excellent post that lists the problems introduced by maintaining a hard distinction between metadata and data.

Terry cites Everything Is Miscellaneous (thanks, Terry), which argues that the distinction, which is hard-coded in the Age of Databases, becomes a merely functional difference in the Age of Messy Links: Metadata is what you know and data is what you’re looking for. For example, the year of a CD is metadata about the CD if you know the year a Bob Dylan CD came out but you don’t remember the title, and the title can be metadata if you know the title but want to find the year. And in both cases, it could all be metadata in your search for lyrics.

This is all very squishy and messy because the distinction is, as Terry says, artificial. It comes from thinking about experience as content that gets processed, as if we worked the way computers do. More exactly, it comes from thinking about experience as a set of Experience Atoms that then have to be assembled; metadata are the labels that tell you that Atom A goes into Atom Z. But experience is far more like language than like particle physics or Ikea assembly instructions. And that’s for a very good reason: linguistic creatures’ experience cannot be understood apart from language. Language doesn’t neatly separate into content and meta-content. It all comes together and it’s all intertwingled. Language is so very non-atomic that it makes atoms realize how lonely they’ve been.

That doesn’t mean that computer software that separates metadata from data is useless. Lord knows I love a good database. But it also means that computer software that can treat anything as metadata depending on what we’re trying to do opens up some interesting possibilities…

Evolution of Evolution

Ben Fry posts an amazing visualization of the changes in the six editions of Darwin’s Origin of Species, based on meticulous work done by Dr. John van Wyhe and others. From Ben’s introductory text:

The second edition, for instance, adds a notable “by the Creator” to the closing paragraph, giving greater attribution to a higher power. In another example, the phrase “survival of the fittest” — usually considered central to the theory and often attributed to Darwin — instead came from British philosopher Herbert Spencer, and didn’t appear until the fifth edition of the text.

September 4, 2009

The price of free law

The latest Radio Berkman episode has me interviewing Steve Schultze about his RECAP project that posts public domain legal records that otherwise you’d have to pay to access. And the federal courts are not all that happy about it.

Google Books metadata meta-wreck

Geoff Nunberg has a fantastic post warning about the poor quality of the metadata attached to the books Google is scanning into its soon to be dominant-to-the-point-of-monopoly digital library. Apparently, the attempt to gather metadata automatically from the scans has resulted in the introduction of legions of errors. But the real problems are, as Geoff points out, that Google seems not to have a plan for dealing with this problem and that it has not opened up the metadata design process.

August 31, 2009

Copyright’s creative disincentive

Tucows is participating in the Canadian copyright consultation process. Rather than submitting a comment written in the usual lawyerly prose, Elliot Noss, Tucow’s CEO, asked me to write up something about copyright in my usual imprecise and incoherent prose. I like Elliot a lot, and I care about copyright, so I wrote about the argument that without strong copyright protection, creators won’t have an incentive to create. The piece is now posted… [The next day: I absolutely should have mentioned that this was a commissioned piece. I.e., Elliot paid me to write something, and posted it unaltered.]

August 26, 2009

Encyclopedia of Life – Now by Humans!

The Encyclopedia of Life is encouraging citizen contributions to its experts-vetted pages, so far with what seem like excellent results. There’s a good article about this at Science Daily. After two years, they’ve got 150,000 species pages underway, with 1.4 million stubs awaiting drafting.

August 19, 2009

Dilbert goes miscellaneous

Amusing Dilbert today, for those who can’t resist a good taxonomy joke. (Thanks for the tip, Helena!)

August 14, 2009

Search Pidgin

I know I’m not the only one who’s finding WolframAlpha sometimes frustrating because I can’t figure out the magic words to use to invoke the genii. To give just one example, I can’t figure out how to see the frequency of the surnames Kumar and Weinberger compared side-by-side in WolframAlpha’s signature fashion. It’s a small thing because “surname Kumar” and “surname Weinberger” will get you info about each individually. But over and over, I fail to guess the way WolframAlpha wants me to phrase the question.

Search engines are easier because they have already trained us how to talk to them. We know that we generally get the same results whether we use the stop words “when,” “the,” etc. and questions marks or not. We eventually learn that quoting a phrase searches for exactly that phrase. We may even learn that in many engines, putting a dash in front of a word excludes pages containing it from the results, or that we can do marvelous and magical things with prefaces that end in a colon site:, define:. We also learn the semantics of searching: If you want to find out the name of that guy who’s Ishmael’s friend in Moby-Dick, you’ll do best to include some words likely to be on the same page, so “‘What was the name of that guy in Moby-Dick who was the hero’s friend?'” is way worse than “Moby-Dick harpoonist’.” I have no idea what the curve of query sophistication looks like, but most of us have been trained to one degree or another by the search engines who are our masters and our betters.

In short, we’re being taught a pidgin language — a simplified language for communicating across cultures. In this case, the two cultures are human and computers. I only wish the pidgin were more uniform and useful. Google has enough dominance in the market that its syntax influences other search engines. Good! But we could use some help taking the next step, formulating more complex natural language queries in a pidgin that crosses application boundaries, and that isn’t designed for standard database queries.

Or does this already exist?



