|
Hackers and Painters: Big Ideas from the Computer AgeI've found, again by trial and error, that .4 is a good number to use. If you've never seen a word before, it is probably fairly innocent; spam words tend to be all too familiar. I treat mail as spam if the algorithm above gives it a probability of more than .9 of being spam. But in practice it would not matter much where I put this threshold, because few probabilities end up in the middle of the range. One great advantage of the statistical approach is that you don't have to read so many spams. Over the past six months, I've read literally thousands of spams, and it is really kind of demoralizing. Norbert Wiener said if you compete with slaves you become a slave, and there is something similarly degrading about competing with spammers. To recognize individual spam features you have to try to get into the mind of the spammer, and frankly I want to spend as little time inside the minds of spammers as possible. But the real advantage of the Bayesian approach, of course, is that you know what you're measuring. Feature-recognizing filters like Spam Assassin assign a spam "score" to email ...» |
Код для вставки книги в блог HTML
phpBB
текст
|
|