spamassassin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzilla-dae...@bugzilla.spamassassin.org
Subject [Bug 3749] message parser skips blank/invalid(?) parts
Date Mon, 06 Sep 2004 01:05:38 GMT
http://bugzilla.spamassassin.org/show_bug.cgi?id=3749





------- Additional Comments From felicity@kluge.net  2004-09-05 18:05 -------
Subject: Re:  message parser skips blank/invalid(?) parts

On Sun, Sep 05, 2004 at 11:13:37AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> aha!  the extra node thing is caused by malformed mime boundaries.
> 
> ----9382928791803058304---
> 
> note the extra dash at the end of the bottom boundary -- so our parser sees that as a
new mime part 
> which is empty (nothing after that boundary), defaults to text/plain as it should, etc.

Ok, the original/current version of the parser looked for different REs
at different times:

/^--$boundary$/		# find preamble
/^--$boundary/		# find end of any part
/^--$boundary--$/	# find end of multipart

In re-reading rfc 1521, there is a specific requirement for
the "$" EOL part:

"The boundary must be followed immediately either by
another CRLF and the header fields for the next part, or by two
CRLFs, in which case there are no header fields for the next part
(and it is therefore assumed to be of Content-Type text/plain)."

and

"The encapsulation boundary following the last body part is a
distinguished delimiter that indicates that no further body parts
will follow.  Such a delimiter is identical to the previous
delimiters, with the addition of two more hyphens at the end of the
line [...]"

it's even clearer in the BNF section of 7.2.1, BTW:

   delimiter := "--" boundary CRLF ; taken from Content-Type field.
                                   ; There must be no space
                                   ; between "--" and boundary.

   close-delimiter := "--" boundary "--" CRLF ; Again, no space by "--",


Since MUAs (at least the ones I quickly tested) deal with the invalid MIME
boundary as already discussed, we should as well.

However, we can't just remove the /$/ part since, for instance: if you
have an outside boundary of XXXX, and a multipart inside had a boundary
of XXXXAA, without the EOL check the parser gets confused about which
part is which (it'll assume the inside multipart ended early when it
finds the first sub-multipart).  So I've settled on:

/^--$boundary\s*$/		# find preamble
/^--$boundary(?:--|\s*$)/	# find end of any part
/^--$boundary--/		# find end of multipart

This will allow us to deal with RFC compliant messages, and the malformed MIME
boundaries that I've seen in my corpus during testing.


Ok, so I'll attach the new patch shortly.  Of the 34k messages I ran through,
there were 107 messages with different results.  The break-down of those
changes are:

84 +MPART_ALT_DIFF
14 +MIME_HTML_MOSTLY -MIME_HTML_ONLY -MIME_HTML_ONLY_MULTI -FORGED_OUTLOOK_HTML
 7 +MIME_HTML_MOSTLY -MIME_HTML_ONLY -MIME_HTML_ONLY_MULTI
 1 +__MIME_BASE64
 1 +FORGED_OUTLOOK_HTML +FORGED_OUTLOOK_TAGS +HTML_MIME_NO_HTML_TAG +MIME_HTML_ONLY +__HIGHBITS
+__MIME_HTML

I went through these by hand.  The messages fall into a small set of
categories:

1) a blank text/* part that used to be skipped (mostly text/plain, the forged
   outlook one was text/html which was blank + had no end boundary, the
   __MIME_BASE64 one was a nigerian scam with a malformed base64 blank attachment at
   the end w/ no end boundary), etc.

2) (via the patch from bug 3751) a multipart message we weren't parsing
   correctly before due to malformed boundary

So at this point, I think the parser will be "doing the right thing"(tm).





------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

Mime
View raw message