Let’s say you have the weird desire to see if a particular online news site is producing news articles that support the Schema.org standard. I just posted a tiny little site — even uglier than usual — that lets you search for a particular news media site. It will return the items on that site that have been classified by that site as newsArticles in the Schema.org standard.
Thanks to a suggestion from Dan Brickley, it’s using a custom search engine from Google. One of the parameters permitted by custom search engines is to only return items that are one of Schema.org’s types. (I’m sure I’m messing up the standards lingo.) All I’ve done is specify newsArticle as the type, and prepended “site:” to whatever search you’re doing, saving you five keystrokes. You’re welcome!
If you get back a bunch of articles, then presumably the site is supporting Schema.org. I think.
Gary Price from Infodocket is moderating a panel on what’s new in search. It’s a panel of vendors
NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.
The first speaker is from Blekko.com, which he says “is thought of as the third search engine” in the US market. It features info from authoritative sources. “You don’t want your health information to come from some blog.” When you search for “kate spade” you get authenticated Kate Spade fashion stuff. Slashtags let you facet within a topic, based on expert curation. Users can create their own slashtags. At /webgrep you can ask questions about the corpus that if upvoted the techies at Blekko will answer.
Weblib.com describes itself on its site as “Natural Language Processing Tools and Customizable Knowledge Bases for Semantic Search and Discovery Applications.” Thomas talks about OntoFind and semantic search, which is a search that produces “meaningful results even when the retrieved pages” contain none of the search terms [latent semantic search!]. He points to Google’s Freebase, which has info about 500M entities and their relationships. In a week you’ll be able to try OntoFind at ezu.com, I believe. Searching for big brother and privacy first asks you to disambiguate and then pulls together results.
ScienceScape.com is designed to help scientists follow science. It diagrams publications on a topic, and applies article-level metrics. It’s focused on the undergrad and graduate research markets. It integrates genomic knowledge plus much more. It lets you see the history of science top down, and browse e.g. by date. You can share what you’ve found.
There’s a fascinating post at ReadwriteWeb by Scott M. Fulton III about the effect “social signals” such as posts by people within your Google+ Circles, has on search results. It is not an easy article to skim :) Here’s the conclusion:
It is obvious from our test so far, which spanned a 48-hour period, that there may be an unintended phenomenon of the infusion of social signals into all Google searches: the reduction in visibility in search results of the original article that generated all the discussion in the first place. This may have a counter-balancing effect on the popularity of any article…
Soo Young Rieh is an associate professor at the University of Michigan School of Information. She recently finished a study (funded in part by MacArthur) on how people assess the credibility of sources when they are just searching for information and when they are actually posting information. Her study didn’t focus on a particular age or gender, and found [SPOILER] that we don’t take extra steps to assess the credibility of information when we are publishing it.
Me neither. Relevancy is not an objective criterion. And too much transparency allows spammers to game the system. I would like to be assured that companies aren’t paying search engine companies to have their results ranked higher (unless the results are clearly marked as pay-for-position, which Google does but not clearly enough).
Ann Hunt is describing Primal‘s ability to let people create what she calls “idiosyncratic ontologies.” It wants to let two people have differing tags and ontologies about the same objects, and see the shared and social point of view. From the Primal site: “The Primal Semantics API helps users find material of interest in a larger collection of information. It organizes responses into hierarchies of concepts, with broad topics leading to more specific ones.” Ann stresses that it’s cool to bring together individual points of view and semantic networks.
Bob Smith of ISYS Search Software says that most people don’t find what they’re looking for on Google the first time they search. Google is an ad company, not a search company, so “you shouldn’t buy your next search service from an ad company.” Today, we need search everywhere, for everything. Bob then pitches us on Isys.
Brian Cheek of TigerLogic says he’s in the search enhancement business. Links make problems for searches, he says. Google instant preview helps a little, he says, if it’s for a site you’ve been to already. He focuses on YoLink, which provides more intelligent searching and browsing within particular domains. It’s a browser add-on that’s available for incorporation into apps by developers. YoLink mines links, extracting content from them based on your key terms. You can check-of the returns of interest and publish them directly into a Google Doc or tweet them. You can explore a set of links without having to browse to each of them.
Categories: misc Tagged with: search Date: November 18th, 2010 dw
The Oxford English Dictionary has announced that it will not print new editions on paper. Instead, there will be Web access and mobile apps.
According to the article in the Telegraph, “A team of 80 lexicographers has been working on the third edition of the OED â€“ known as OED3 â€“ for the past 21 years.”
It has been a long trajectory toward digitization for the OED. In the 1990s, the OED’s desire to produce a digital version (remember books on CD?) stimulated search engine innovation. To search the OED intelligently, the search engine would have to understand the structure of entries, so that it could distinguish the use of a word as that which is being defined, the use of it within a definition, the use of it within an illustrative quote, etc. SGML was perfect for this type of structure, and the Open Text SGML search engine came out of that research. Tim Bray [twitter:timbray] was one of the architects of that search engine, and went on to become one of the creators of XML. I’m going to assume that some of what Tim learned from the OED project was formative of his later thinking… (Disclosure: I worked at Open Text in the mid-1990s.)
On the other hand, initially, the OED didn’t want to attribute the origins of the word “blog” to Peter Merholz because he coined it in his own blog, and the OED would only accept print attributions. (See here, too.) the OED eventually got over this prejudice for printed sources, however, and gave Peter proper credit.
I know I’m not the only one who’s finding WolframAlpha sometimes frustrating because I can’t figure out the magic words to use to invoke the genii. To give just one example, I can’t figure out how to see the frequency of the surnames Kumar and Weinberger compared side-by-side in WolframAlpha’s signature fashion. It’s a small thing because “surname Kumar” and “surname Weinberger” will get you info about each individually. But over and over, I fail to guess the way WolframAlpha wants me to phrase the question.
Search engines are easier because they have already trained us how to talk to them. We know that we generally get the same results whether we use the stop words “when,” “the,” etc. and questions marks or not. We eventually learn that quoting a phrase searches for exactly that phrase. We may even learn that in many engines, putting a dash in front of a word excludes pages containing it from the results, or that we can do marvelous and magical things with prefaces that end in a colon site:, define:. We also learn the semantics of searching: If you want to find out the name of that guy who’s Ishmael’s friend in Moby-Dick, you’ll do best to include some words likely to be on the same page, so “‘What was the name of that guy in Moby-Dick who was the hero’s friend?'” is way worse than “Moby-Dick harpoonist’.” I have no idea what the curve of query sophistication looks like, but most of us have been trained to one degree or another by the search engines who are our masters and our betters.
In short, we’re being taught a pidgin language â€” a simplified language for communicating across cultures. In this case, the two cultures are human and computers. I only wish the pidgin were more uniform and useful. Google has enough dominance in the market that its syntax influences other search engines. Good! But we could use some help taking the next step, formulating more complex natural language queries in a pidgin that crosses application boundaries, and that isn’t designed for standard database queries.
Today, for the very first time in my experience, The Encyclopedia Britannica was the #1 result at Google for a query.
It’s good to see the EB making progress with its online offering, but I’m actually puzzled in this case. The query was “horizontal hold” (without quotes), and the EB page that’s #1 is pretty much worthless. It’s a stub that gives a snippet of the article on the topic, but the snippet oddly begins with definition #4. The page then points us into actual articles in the EB, but they’re articles you have to pay for (although the EB offers a “no risk” free trial).
So, how did Google’s special sauce float this especially unhelpful page to the surface? And why isn’t there a Wikipedia page on “horizontal hold”? And does this mean that if there’s no Wikipedia page for a topic, Google gets the vapors and just doesn’t know what to recommend? Nooooo………