Joho the Blog
|
|
|
July 02, 2003
I was curious why an email the subject line of which is "Watch these girls flash their racks for each other" got through Popfile, my Bayesian spam filter. Popfile is remarkably accurate at sniffing out the spammers. In this case, though, the message consists of a small graphic ("sg-titties-graphic") — with my email address encoded in the link so if I click on it, they know I'm alive and horny — and some invisible text that says:
I'm assuming that those words are rare in pornospam and thus successfully fooled PopFile. It'd be nice if PopFile recognized invisible text as invisible... Posted
by D. Weinberger at July 2, 2003 09:45 AM
TrackBackListed below are links to weblogs that reference Fooling Bayes:
» Filtering No Silver Bullet from It's Just this Little Chromium Switch Here Tracked on July 2, 2003 12:39 PM
» Don't like spam!? from Blog de Halavais Tracked on July 2, 2003 07:06 PM |
Comments
Especially since the invisible text technique has been around since forever as a fool-the-search-engine technique. Detecting and ignoring invisible text was one of the first anti-spam devices we put into the original Lycos indexer.
Posted by: Dennis Doughty | July 2, 2003 10:16 AM
Actually I think that ignoring hidden text for bayesian filtering is counterproductive. Those words are giveaways to recognize spam, too. Do you get much non-spam messages with those words in them? And what are the counts in each bucket on those words? Sure, from time to time a spam get's through, but you just retrain your filter and if the words are spread enough through all buckets, they will be irrelevant to the recognition process.
One special problem is what _is_ hidden text? Text in HTML source that is invisible because of colors? or because of visibility="false" settings? This would require the filter to parse and understand (render) HTML.
Or is it text in some obscure place your mail program happens to not show? How should the filter recognize this?
Actually often the existence of hidden text is a giveaway to bayesian filters, because of token combinations that construct the state of "hiddenness". What _is_ needed is more intelligent tokenizing in bayesian filters, as often the available tokens are not recognized, because the tokenizer breaks them down to single chars or pulls them together into one meaningless string.
I found Popfile to only miss very seldom, so I think it is already quite good on this stuff :-)
Posted by: Georg Bauer | July 2, 2003 10:56 AM
Go into POPFile's control center and look at the message in the History. Click on the Subject of that message and POPFile will show you which words it used and why. There may have been something about the header that carried more weight than those words. POPFile gives weight (in this case negative) to being invisible, I think -- you'll see that there, too on the Scores list on the right.
Posted by: DanB | July 2, 2003 11:13 AM
POPFile has for some time detected Invisble Ink text and recently the Camouflage technique.
John.
Posted by: John Graham-Cumming | July 21, 2003 04:15 PM