Joho the Blog » search

March 26, 2015

Searching for news media that support Schema.org

Let’s say you have the weird desire to see if a particular online news site is producing news articles that support the Schema.org standard. I just posted a tiny little site — even uglier than usual — that lets you search for a particular news media site. It will return the items on that site that have been classified by that site as newsArticles in the Schema.org standard.

Thanks to a suggestion from Dan Brickley, it’s using a custom search engine from Google. One of the parameters permitted by custom search engines is to only return items that are one of Schema.org’s types. (I’m sure I’m messing up the standards lingo.) All I’ve done is specify newsArticle as the type, and prepended “site:” to whatever search you’re doing, saving you five keystrokes. You’re welcome!

If you get back a bunch of articles, then presumably the site is supporting Schema.org. I think.

Follow me

Categories: programs Tagged with: metadata • programs • schema.org • search • shorenstein Date: March 26th, 2015 dw

2 Comments »

October 22, 2012

[internet librarian] Search tools

Gary Price from Infodocket is moderating a panel on what’s new in search. It’s a panel of vendors

NOTE: Live-blogging. Getting things wrong. Missing points. Omitting key information. Introducing artificial choppiness. Over-emphasizing small matters. Paraphrasing badly. Not running a spellpchecker. Mangling other people’s ideas and words. You are warned, people.

The first speaker is from Blekko.com, which he says “is thought of as the third search engine” in the US market. It features info from authoritative sources. “You don’t want your health information to come from some blog.” When you search for “kate spade” you get authenticated Kate Spade fashion stuff. Slashtags let you facet within a topic, based on expert curation. Users can create their own slashtags. At /webgrep you can ask questions about the corpus that if upvoted the techies at Blekko will answer.

Weblib.com describes itself on its site as “Natural Language Processing Tools and Customizable Knowledge Bases for Semantic Search and Discovery Applications.” Thomas talks about OntoFind and semantic search, which is a search that produces “meaningful results even when the retrieved pages” contain none of the search terms [latent semantic search!]. He points to Google’s Freebase, which has info about 500M entities and their relationships. In a week you’ll be able to try OntoFind at ezu.com, I believe. Searching for big brother and privacy first asks you to disambiguate and then pulls together results.

ScienceScape.com is designed to help scientists follow science. It diagrams publications on a topic, and applies article-level metrics. It’s focused on the undergrad and graduate research markets. It integrates genomic knowledge plus much more. It lets you see the history of science top down, and browse e.g. by date. You can share what you’ve found.

[I couldn’t hear the Q&A well enough to blog it.]

Follow me

Categories: liveblog Tagged with: internetlibrarian • liveblog • search Date: October 22nd, 2012 dw

1 Comment »

January 7, 2012

Does Google’s use of ‘social signals’ break the Web?

There’s a fascinating post at ReadwriteWeb by Scott M. Fulton III about the effect “social signals” such as posts by people within your Google+ Circles, has on search results. It is not an easy article to skim :) Here’s the conclusion:

It is obvious from our test so far, which spanned a 48-hour period, that there may be an unintended phenomenon of the infusion of social signals into all Google searches: the reduction in visibility in search results of the original article that generated all the discussion in the first place. This may have a counter-balancing effect on the popularity of any article…

Follow me

Categories: social media, too big to know Tagged with: 2b2k • google • search • social networks Date: January 7th, 2012 dw

Be the first to comment »

October 7, 2011

[2b2k] How we assess credibility

Soo Young Rieh is an associate professor at the University of Michigan School of Information. She recently finished a study (funded in part by MacArthur) on how people assess the credibility of sources when they are just searching for information and when they are actually posting information. Her study didn’t focus on a particular age or gender, and found [SPOILER] that we don’t take extra steps to assess the credibility of information when we are publishing it.

Follow me

Categories: education, too big to know Tagged with: 2b2k • authority • credibility • literacy • search Date: October 7th, 2011 dw

2 Comments »

January 24, 2011

Grimmelman non search neutrality

James Grimmelmann, whose writing on the Google Books settlement I’ve found helpful, has written an article about the incoherence of the concept of “search neutrality” â€” “the idea that search engines should be legally required to exercise some form of even-handed treatment of the websites they rank. ” (He blogs about it here.) He finds eight different possible meanings of the term, and doesn’t think any of them hold up.

Me neither. Relevancy is not an objective criterion. And too much transparency allows spammers to game the system. I would like to be assured that companies aren’t paying search engine companies to have their results ranked higher (unless the results are clearly marked as pay-for-position, which Google does but not clearly enough).

Follow me

Categories: misc Tagged with: grimmelmann • search • search neutrality Date: January 24th, 2011 dw

5 Comments »

December 17, 2010

The Annals of Searching: Cluetrain circa 1505

Confine your search at Google Books for only the 19th century Cluetrain references, and you get four hits. In fact, the earliest reference to Cluetrain indexed by Google Books was in the 1505 business best-seller Extravagantes com[m]unes, in which appears the sentence “Markets are conversations…with that lying bastard Roger the Offal Merchant.”

Follow me

Categories: cluetrain Tagged with: cluetrain • google • humor • search Date: December 17th, 2010 dw

1 Comment »

November 18, 2010

[defrag] Semantic 10 minute sessions

Ann Hunt is describing Primal‘s ability to let people create what she calls “idiosyncratic ontologies.” It wants to let two people have differing tags and ontologies about the same objects, and see the shared and social point of view. From the Primal site: “The Primal Semantics API helps users find material of interest in a larger collection of information. It organizes responses into hierarchies of concepts, with broad topics leading to more specific ones.” Ann stresses that it’s cool to bring together individual points of view and semantic networks.

Bob Smith of ISYS Search Software says that most people don’t find what they’re looking for on Google the first time they search. Google is an ad company, not a search company, so “you shouldn’t buy your next search service from an ad company.” Today, we need search everywhere, for everything. Bob then pitches us on Isys.

Brian Cheek of TigerLogic says he’s in the search enhancement business. Links make problems for searches, he says. Google instant preview helps a little, he says, if it’s for a site you’ve been to already. He focuses on YoLink, which provides more intelligent searching and browsing within particular domains. It’s a browser add-on that’s available for incorporation into apps by developers. YoLink mines links, extracting content from them based on your key terms. You can check-of the returns of interest and publish them directly into a Google Doc or tweet them. You can explore a set of links without having to browse to each of them.

Follow me

Categories: misc Tagged with: search Date: November 18th, 2010 dw

Be the first to comment »

August 14, 2009

Search Pidgin

I know I’m not the only one who’s finding WolframAlpha sometimes frustrating because I can’t figure out the magic words to use to invoke the genii. To give just one example, I can’t figure out how to see the frequency of the surnames Kumar and Weinberger compared side-by-side in WolframAlpha’s signature fashion. It’s a small thing because “surname Kumar” and “surname Weinberger” will get you info about each individually. But over and over, I fail to guess the way WolframAlpha wants me to phrase the question.

Search engines are easier because they have already trained us how to talk to them. We know that we generally get the same results whether we use the stop words “when,” “the,” etc. and questions marks or not. We eventually learn that quoting a phrase searches for exactly that phrase. We may even learn that in many engines, putting a dash in front of a word excludes pages containing it from the results, or that we can do marvelous and magical things with prefaces that end in a colon site:, define:. We also learn the semantics of searching: If you want to find out the name of that guy who’s Ishmael’s friend in Moby-Dick, you’ll do best to include some words likely to be on the same page, so “‘What was the name of that guy in Moby-Dick who was the hero’s friend?'” is way worse than “Moby-Dick harpoonist’.” I have no idea what the curve of query sophistication looks like, but most of us have been trained to one degree or another by the search engines who are our masters and our betters.

In short, we’re being taught a pidgin language â€” a simplified language for communicating across cultures. In this case, the two cultures are human and computers. I only wish the pidgin were more uniform and useful. Google has enough dominance in the market that its syntax influences other search engines. Good! But we could use some help taking the next step, formulating more complex natural language queries in a pidgin that crosses application boundaries, and that isn’t designed for standard database queries.

Or does this already exist?

Tags: search pidgin nlp natural_language_processing google everything_is_miscellaneous

Follow me

Categories: Uncategorized Tagged with: everythingIsMiscellaneous • everything_is_miscellaneous • google • metadata • natural_language_processing • nlp • pidgin • search Date: August 14th, 2009 dw

4 Comments »

July 19, 2009

Britannica: #1 at Google

Today, for the very first time in my experience, The Encyclopedia Britannica was the #1 result at Google for a query.

It’s good to see the EB making progress with its online offering, but I’m actually puzzled in this case. The query was “horizontal hold” (without quotes), and the EB page that’s #1 is pretty much worthless. It’s a stub that gives a snippet of the article on the topic, but the snippet oddly begins with definition #4. The page then points us into actual articles in the EB, but they’re articles you have to pay for (although the EB offers a “no risk” free trial).

So, how did Google’s special sauce float this especially unhelpful page to the surface? And why isn’t there a Wikipedia page on “horizontal hold”? And does this mean that if there’s no Wikipedia page for a topic, Google gets the vapors and just doesn’t know what to recommend? Nooooo………

[Tags: google wikipedia encyclopedia_britannica britannica search horizontal_hold ]

Follow me

Categories: misc Tagged with: britannica • encyclopedia_britannica • everythingIsMiscellaneous • google • horizontal_hold • search • wikipedia Date: July 19th, 2009 dw

10 Comments »

July 17, 2009