Looks like spammers are trying to fool Bayesian filters by diluting their text. I got a spam today with two lines at the top advertising “cash freedom” or something, and I noticed that the message scrolled quite a bit. After about a hundred line breaks, I found this:
I can now see that I have not been and opposed slavery as a denial of that unity, have also won; but with large and protuberant noses, very furry or very bristly hair,
I did a Google search, and it looks like it’s an excerpt from “The Island of Doctor Moreau.”
It would appear that spammers are trying to “hide” the text of their message amidst “legitimate” text to trick Bayesian filters into letting them through. It worked in this case — Thunderbird did not mark it as spam.
I forwarded it to an account I monitor with Outlook equipped with SpamBayes. It was tagged as “Unsure” with a score of 20%, periloiusly close to the 15% it needs to be categorized as legitimate (although, to be fair, it was sent from one of my own accounts, and SpamBayes may have incorporated that fact into the score).
I think the key here is the number of spammish tokens compared to the number of legitimate tokens. If they can load a message up with legitimate text — hidden with line breaks or some other cloaking scheme — Bayesian filters will get confused enough to let it through.
My only comment to the spammers about this is: send me a different chapter next time. In fact, I’ll make you a deal. If you put a couple lines of spam at the top of a message, then, under that, print a complete chapter of “The Island of Doctor Moreau,” I will read your message. In fact, send me 23 messages — one for each chapter. After reading about penis enlargment 23 times, maybe I’ll consider it.
Follow Gadgetopia on Twitter
Here's something that may be a new trend in spam: including semi-newsworthy information in an attempt to add some value to the spam. I got an email the other day entitled "Medal Count." It appeared, for all intents and purposes, to be an accurate medal count for each country from…
Spammers turn to classic prose: The BBC has an article on the spam/literature phenomenon I mentioned the other day. [...] as Clive Thompson points out, automatically generating text that reads like it was written by a human hand is difficult. This is perhaps why some spammers are turning to out-of-copyright novels…
SpamBayes Outlook Addin: Here's a spam filter for Outlook that runs on Bayesian theory. You give it a folder of good emails (your inbox) and a folder of spam that you've collected, then let it analyze both. From then on, it will use this information to assign a score to…