spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "" <>
Subject Re: Suggestion: OCR
Date Wed, 02 Mar 2005 19:12:35 GMT

 --- On Wed 03/02, Matt Kettler < > wrote:

That part is definitely NOT safe in the context of spamassassin... Nonsense looks a lot like
bugs in spam mailers, and very little like legitimate email to SA.

If nothing else, consider the tripwire rules, which look for letter 
combinations that don't exist in normal English...

Thanks! If so, then it's a bit more work to implement. For example, a trivial idea is not
to let the attachments, which stem from images, go through the rules that search for nonsense.

I meant 'safe' in the following sense: if the tool says some meaningful word (e.g. present
in the english wordlist up to a small misspell), then this word is surely present in the image
up to a small misspell. So, if some spam rule sees "viagra" or 'click here to get removed'
after OCRing, then it is 'safe' to give a hit for it, for example.

Another work-intensive method could be as follows (corrections are welcome)
1. OCR.
2. Throw out all the words which are not in the english (german, russian, etc...) dictionary
up to a misspell. E.g. tolerate at most one error per word. Correct the misspelled words.
(Fast dictionary search required, e.g. represent wordlists as binary balanced trees.)
3. run other text-based rules.

Actually, I posted because I get too much image spam (which goes ok through SA) and tried
to determine the possibility of catching it with the present tools. Sometimes I get photos
and image-smileys so I'm very reluctant to stop all mails containing images without inspecting

My strong belief is that such tools as gocr can really help. The other question is how to
integrate it in SA and who does it. I'm afraid I cannot dig into the SA code myself; so it's
a suggestion to the advanced users and developers.


Join Excite! -
The most personalized portal on the Web!

View raw message