spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Axb <axb.li...@gmail.com>
Subject Re: 23_bayes_ignore_header.cf
Date Tue, 14 Oct 2014 21:54:56 GMT
On 10/14/2014 05:07 PM, RW wrote:
> On Tue, 14 Oct 2014 13:58:27 +0200
> Axb wrote:
>
>> On 10/14/2014 01:51 PM, RW wrote:
>>> On Tue, 14 Oct 2014 10:44:51 +0200
>>> Axb wrote:
>>>
>>>>
>>>> have you verified that some of these are not included?
>>>>
>>>> X-Originating-IP will not be included as it can be used to help
>>>> detect ham or spam
>>>
>>> It's really no different to other headers you are ignoring.
>>
>> for example, if you get a flood of 419s from the same source, you may
>> want it to be tokenized...
>
>
> As I do with, for example:
>
>    X-AntiAbuse: Originator/Caller UID/GID - [514 32007] / [47 12]
>
> in this spam Bayes found
>
>    0.999-4--HX-AntiAbuse:32007
>
> These numbers seem to be very good indicators for me.
>
>
> Most of the headers in the file have never appeared in my ham, so
> they'll be pure spam indicators if they are ever faked. In general
> it's difficult for a spammer to gain an overall advantage against
> an average per user database using faked headers.
>
> Whatever the merits of this on system-wide Bayes (if any beyond
> reducing token count), I think it would have a negative effect on
> per user Bayes.
>

oooooooooooook..
now here's a suprise (it's all in the code :)

the Bayes.pm plugin alreafy includes:


# Which headers should we scan for tokens?  Don't use all of them, as 
it's easy
# to pick up spurious clues from some.  What we now do is use all of them
# *less* these well-known headers; that way we can pick up spammers' 
tracking
# headers (which are obviously not well-known in advance!).

# Received is handled specially
$IGNORED_HDRS = qr{(?: (?:X-)?Sender    # misc noise
   |Delivered-To |Delivery-Date
   |(?:X-)?Envelope-To
   |X-MIME-Auto[Cc]onverted |X-Converted-To-Plain-Text

   |Subject      # not worth a tiny gain vs. to db size increase

   # Date: can provide invalid cues if your spam corpus is
   # older/newer than ham
   |Date

   # List headers: ignore. a spamfiltering mailing list will
   # become a nonspam sign.
   |X-List|(?:X-)?Mailing-List
   |(?:X-)?List-(?:Archive|Help|Id|Owner|Post|Subscribe
     |Unsubscribe|Host|Id|Manager|Admin|Comment
     |Name|Url)
   |X-Unsub(?:scribe)?
   |X-Mailman-Version |X-Been[Tt]here |X-Loop
   |Mail-Followup-To
   |X-eGroups-(?:Return|From)
   |X-MDMailing-List
   |X-XEmacs-List

   # gatewayed through mailing list (thanks to Allen Smith)
   |(?:X-)?Resent-(?:From|To|Date)
   |(?:X-)?Original-(?:From|To|Date)

   # Spamfilter/virus-scanner headers: too easy to chain from
   # these
   |X-MailScanner(?:-SpamCheck)?
   |X-Spam(?:-(?:Status|Level|Flag|Report|Hits|Score|Checker-Version))?
   |X-Antispam |X-RBL-Warning |X-Mailscanner
   |X-MDaemon-Deliver-To |X-Virus-Scanned
   |X-Mass-Check-Id
   |X-Pyzor |X-DCC-\S{2,25}-Metrics
   |X-Filtered-B[Yy] |X-Scanned-By |X-Scanner
   |X-AP-Spam-(?:Score|Status) |X-RIPE-Spam-Status
   |X-SpamCop-[^:]+
   |X-SMTPD |(?:X-)?Spam-Apparently-To
   |SPAM |X-Perlmx-Spam
   |X-Bogosity

   # some noisy Outlook headers that add no good clues:
   |Content-Class |Thread-(?:Index|Topic)
   |X-Original[Aa]rrival[Tt]ime

   # Annotations from IMAP, POP, and MH:
   |(?:X-)?Status |X-Flags |X-Keywords |Replied |Forwarded
   |Lines |Content-Length
   |X-UIDL? |X-IMAPbase

   # Annotations from Bugzilla
   |X-Bugzilla-[^:]+

   # Annotations from VM: (thanks to Allen Smith)
   |X-VM-(?:Bookmark|(?:POP|IMAP)-Retrieved|Labels|Last-Modified
     |Summary-Format|VHeader|v\d-Data|Message-Order)

   # Annotations from Gnus:
   | X-Gnus-Mail-Source
   | Xref

)}x;

# Note only the presence of these headers, in order to reduce the
# hapaxen they generate.
$MARK_PRESENCE_ONLY_HDRS = qr{(?: X-Face
   |X-(?:Gnu-?PG|PGP|GPG)(?:-Key)?-Fingerprint
   |D(?:KIM|omainKey)-Signature
)}ix;

funny...


Mime
View raw message