Joho the Blog
An Entry from the Archives

« The Long Tail || Back to Blog | God needs a better script doctor »

October 10, 2004

Bayesian spellchecker?

In the '90s, IBM had a machine translation project that bested rule-based translators simply by using probabilities deduced from analyzing the word usage patterns in a large corpus of manually-translated material. (They used the French and English versions of the proceedings of the Canadian Parliament.) Now Bayesian spam filters are all the rage, using word frequency analyses of known spam and non-spam to decide which folder to put a particular message in.

So why not use similar analyses to guide spellchecker alternatives? An analysis of my corpus of documents would reveal that the putative word "cheast" is more likely to be "cheats" if used near the word "game" and "stuck," but more likely to be "chaste" if used near "Britney" and "supposedly." Given how well Bayesian spam filters work - they work really well - I might even want to say that if the spellchecker is, say, 95% confident, it should make the change without asking me, while enabling me to review all the auto-changed words, of course.

I am a genuine admirer of Microsoft Word's spellchecker; in fact, it's one of the things keeping me from switching to Open Office. Not only does Word's UI let me correct errors the way I want, jumping from clicking on a list to editing in context, but its first suggestion is almost always the right one. So, I assume Microsoft has analyzed some generic corpus to get the probabilities right. Why not analyze my corpus, too? And do it folder by folder, across time, and by document type. Why not?

Posted by D. Weinberger at October 10, 2004 02:52 PM


Post a comment

Guidelines for Commenting

Basically, you can say what you want. (Click here for the fine print.)

If you haven't left a comment here before, your comment may be put into a queue for me to approve. Sorry for the delay. Blame the damn spammers.