Joho the Blog
An Entry from the Archives

« Joke Email Forwarded! Fire the Bastard! || Back to Blog | Chicken sex »

May 01, 2004

Latent semantic indexing explained

In response to my blogging about pages not saying what they're about, Hanan Cohen points us to an exceptionally well-written article by Clara Yu, John Cuadrado, Maciej Ceglowski and J. Scott Payne about latent semantic indexing (not to be confused with latex cement and indenting).

Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant... Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent.

When you search an LSI-indexed database, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, LSI does not require an exact match to return useful results. Where a plain keyword search will fail if there is no exact match, LSI will often return relevant documents that don't contain the keyword at all.

For example: "In an AP news wire database, a search for Saddam Hussein returns articles on the Gulf War, UN sanctions, the oil embargo, and documents on Iraq that do not contain the Iraqi president's name at all."

This is a very well-done article. And it even includes a link to an application of LSI: An automatic essay grader (which is temporarily down because a class is actually using it).

Posted by D. Weinberger at May 1, 2004 08:33 AM


TrackBack

Listed below are links to weblogs that reference Latent semantic indexing explained:

» Happy Birthday Many2Many from Korby Parnell's WebLog
[Read More]

Tracked on May 1, 2004 03:36 PM

» Latent Semantic Indexing from SunSITE @ Tennessee
Clara Yu, John Cuadrado, Maciej Ceglowski and J. Scott Payne. Patterns in Unstructured Data: Discovery, Aggregation, and Visualization. A Presentation to the Andrew W. Mellon Foundation. 2002. From Rajesh Jain's blog, who got heard about it from David ... [Read More]

Tracked on May 4, 2004 01:20 PM

Comments

Findory News uses a similar techique as a part of its algorithms for personalizing the news. Based on articles you've read, Findory News is able to search thousands of sources and find news that matches your interests.

Posted by: Greg Linden | May 1, 2004 08:01 PM


oi!!!!!

tudo bem, adorei seu Blog, sou do brasil...e espero que continue fazendo essas merdas...

Posted by: lita | May 1, 2004 08:06 PM


By the way: I did some experiments once and determined that Google ignores completely any words that you put in the HTML Meta keywords markup.

I suppose they do this because these fields were getting packed with words to try to influence Google's page ranking. But it foils legitimate attempts to add context information to pages.

Posted by: mark | May 1, 2004 09:37 PM


Google (and others, e.g. Teoma) are known to be heavily interested in these ideas as search improvements. See. e.g.

Google Acquires Applied Semantics

[Hmm, the anti-spam check didn't let me give that link with google dot com, but did accept it with google.co.uk - a lesson in itself?]

Posted by: Seth Finkelstein | May 2, 2004 12:00 AM


cuando pongan algo que no sea de uds deben referenciarlo de la pagina donde lo encontraron

Posted by: matsury | May 12, 2004 03:57 AM


One thing to consider is that Google is wedded to the concept of link popularity and the concept it calls PageRank. So any rollout of LSI may have to be balanced with link analysis.

Some people will argue that LSI has already been rolled out in bits. I'm not to sure about that. Google hasn't done a major update in around five months. Makes you wonder what they're up to...

Posted by: Search Engine | October 22, 2004 05:47 PM


My website, http://www.keyword1-keyword2.com/

has some reading about keywords in relation to search engines but it will have to be revised very soon.

Posted by: Search Engine | October 22, 2004 05:49 PM


My website, Search Engine Keywords

has some reading about keywords in relation to search engines but it will have to be revised very soon.

Posted by: Search Engine | October 22, 2004 05:51 PM


I just came back from the WebMaster World Conference in LV. I spoke with a man who recently worked for Google. He explained to me that Google is restructuring their algorithm to align more closely with LSI. Take it as you will!

Posted by: Rob W | November 29, 2004 12:25 PM


You can see more information about LSI here. http://www.nadmedia.net. I will be posting updates of all conferences that I have visited including WebMaster World Conference in LV.

Posted by: Rob W | November 29, 2004 12:28 PM


Hello:
Is it true that you can create software, based on simple neural networks, which autogenerate pages according to LSI so it can feed to search engine spiders?

Thanks!

http://www.consejosparatodo.com

Posted by: resaca | December 20, 2005 11:21 AM


Autogenerate pages according to LSI. Interesting enough ? Is it true ?

Posted by: Tarida | April 4, 2006 05:37 AM


Yes, Latent Semantic Indexing is obviously having a profound affect on keyword research. In fact, we had to completely re-invent keyword research based on LSI changes to various algorithms. Themezoom.com is a result of this LSI knowledge.

Posted by: Russell Wright | May 9, 2006 03:38 PM


Thank you for anwering my question. Some great info about LSI and keyword research!

http://www.how2tile.com/

Posted by: Daniel | May 22, 2006 04:19 PM


Post a comment

Guidelines for Commenting

Basically, you can say what you want. (Click here for the fine print.)

If you haven't left a comment here before, your comment may be put into a queue for me to approve. Sorry for the delay. Blame the damn spammers.