spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JamesDR <rolai...@bellsouth.net>
Subject Re: Suggestion: OCR
Date Wed, 02 Mar 2005 22:22:11 GMT
I like the idea, however, I can see this adding quite a bit of time to 
the scan on large images.  (I've never used gocr so as far as I can 
tell, i compare it to other ocr products I've used and they were all 
pretty slow.) I had the problem you described, mails getting just image 
spams, what I did was hand modify the rules that did hit, I set them 
just a bit higher till it hit the limit that marked it as spam, while 
making sure ham messages (why someone embeds images in their mails is 
beyond me.. but it does happen.)

My 2 cents....
Thanks,
JamesDR

sasha.mal@excite.com wrote:
>  --- On Wed 03/02, Matt Kettler < mkettler@evi-inc.com > wrote:
> 
> 
> 
> That part is definitely NOT safe in the context of spamassassin... Nonsense looks a lot
like bugs in spam mailers, and very little like legitimate email to SA.
> 
> 
> 
> If nothing else, consider the tripwire rules, which look for letter 
> 
> combinations that don't exist in normal English...
> 
> -----------
> 
> 
> 
> Thanks! If so, then it's a bit more work to implement. For example, a trivial idea is
not to let the attachments, which stem from images, go through the rules that search for nonsense.
> 
> 
> 
> I meant 'safe' in the following sense: if the tool says some meaningful word (e.g. present
in the english wordlist up to a small misspell), then this word is surely present in the image
up to a small misspell. So, if some spam rule sees "viagra" or 'click here to get removed'
after OCRing, then it is 'safe' to give a hit for it, for example.
> 
> 
> 
> Another work-intensive method could be as follows (corrections are welcome)
> 
> 1. OCR.
> 
> 2. Throw out all the words which are not in the english (german, russian, etc...) dictionary
up to a misspell. E.g. tolerate at most one error per word. Correct the misspelled words.
(Fast dictionary search required, e.g. represent wordlists as binary balanced trees.)
> 
> 3. run other text-based rules.
> 
> 
> 
> Actually, I posted because I get too much image spam (which goes ok through SA) and tried
to determine the possibility of catching it with the present tools. Sometimes I get photos
and image-smileys so I'm very reluctant to stop all mails containing images without inspecting
images.
> 
> 
> 
> My strong belief is that such tools as gocr can really help. The other question is how
to integrate it in SA and who does it. I'm afraid I cannot dig into the SA code myself; so
it's a suggestion to the advanced users and developers.
> 
> 
> 
> Regards,
> 
> sasha.
> 
> _______________________________________________
> Join Excite! - http://www.excite.com
> The most personalized portal on the Web!
> 

Mime
View raw message