From polloxx <>
Subject Re: FuzzyOCR
Date Sat, 09 Jul 2011 13:40:01 GMT
On Thu, Jul 7, 2011 at 5:18 PM, John Hardin <> wrote:
> On Thu, 7 Jul 2011, polloxx wrote:
>> On Wed, Jul 6, 2011 at 6:33 PM, John Hardin <> wrote:
>>> OK. Just to be clear, you took a jpeg-format image file and used
>>> jpegtopnm
>>> to convert it to a pnm file, and got a correct .pnm image file out? Did
>>> you
>>> do this to verify the exit code from jpegtopnm:
>>>    echo $?
>> $ /usr/bin/jpegtopnm ./spam1.jpg > spam1.pnm
>> jpegtopnm: WRITING PPM FILE
>> spam1.pnm is created.
> Please run this:
>     /usr/bin/jpegtopnm ./spam1.jpg > spam1.pnm ; echo $?
> The return code is likely zero, but let's be _sure_.

Yes, zero.

>>> It would be useful to see the debugging output of spamassassin where it's
>>> talking about fuzzyocr. Do you know how to run spamassassin in debug mode
>>> against a test message?
>> # spamassassin --debug FuzzyOCR < ./spam1.jpg > /dev/null
> Your input there needs to be a complete email message with the image as an
> attachment, not the image itself:

The example eml from Spamassassin works fine:

# spamassassin --debug FuzzyOCR > output

# cat output

Spam detection software, running on the system "", has
identified this incoming email as possible spam.  The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email.  If you have any questions, see
the administrator of that system for details.

Content preview:  Langdon looked again at the fax an ancient myth confirmed
  in black and white. The implications were frightening. He gazed
absently through
   the bay window. The first hint of dawn was sifting through the birch trees
   in his backyard, but the view looked somehow different this morning. As an
   odd combination of fear and exhilaration settled over him, Langdon knew he
   had no choice The man led Langdon the length of the hangar. They rounded
  the corner onto the runway. [...]

Content analysis details:   (24.6 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 3.6 RCVD_IN_PBL            RBL: Received via a relay in Spamhaus PBL
                            [ listed in]
                            [ listed in]
 0.8 DKIM_ADSP_NXDOMAIN     No valid author signature and domain not in DNS
 2.5 DATE_IN_FUTURE_12_24   Date: is 12 to 24 hours after Received: date
 0.0 HTML_MESSAGE           BODY: HTML included in message
 0.0 MIME_QP_LONG_LINE      RAW: Quoted-printable line longer than 76 chars
 0.2 SHORT_HELO_AND_INLINE_IMAGE Short HELO string, with inline image
 1.3 RDNS_NONE              Delivered to internal network by a host with no rDNS
 0.0 T_DOS_OUTLOOK_TO_MX_IMAGE Direct to MX with Outlook headers and an
 9.0 FUZZY_OCR              BODY: Mail contains an image with common
spam text inside
                            [Words found:]
                            ["levitra" in 1 lines]
                            ["cialis" in 1 lines]
                            ["viagra" in 2 lines]
                            [(6 word occurrences found)]
-2.3 AWL                    AWL: From: address is in the auto white-list

The original message was not completely plain text, and may be unsafe to
open with some email clients; in particular, it may contain a virus,
or confirm that your address can receive spam.  If you wish to view
it, it may be safer to save it to a file and open it with an editor.

