james-mime4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Wiederkehr <markus.wiederk...@gmail.com>
Subject Re: Round tripping (MIME4J-112)
Date Thu, 12 Feb 2009 19:51:48 GMT
On Tue, Feb 10, 2009 at 9:26 PM, Stefano Bagnara <io@bago.org> wrote:
> Oleg Kalnichevski ha scritto:
>> Markus Wiederkehr wrote:
>>> I've been investigating the current code a little bit more and I've
>>> come to think that something really goes wrong. Please have a look at
>>> the code.
>>> Class AbstractEntity has a ByteArrayBuffer "linebuf" and a
>>> CharArrayBuffer "fieldbuf". Method fillFieldBuffer() copies bytes from
>>> the underlying stream to "linebuf" (line 146:
>>> instream.readLine(linebuf)). Later on the input bytes are appended to
>>> "fieldbuf" (line 143: fieldbuf.append(linebuf, 0, len)). At this point
>>> bytes are decoded into characters. A closer look at CharArrayBuffer
>>> reveals how this is done:
>>>    int ch = b[i1];
>>>    if (ch < 0) {
>>>        ch = 256 + ch;
>>>    }
>>> This is equivalent to ISO-8859-1 conversion because Latin 1 is the
>>> only charset that directly maps byte codes 00 to ff to unicode code
>>> points 0000 to 00ff.
>>> All works well as long as the underlying stream only contains ASCII
>>> bytes.
>>> But assume the message contains non-ASCII bytes and a Content-Type
>>> field with a charset parameter is also present. In this case the input
>>> bytes should probably be decoded using that specified charset instead
>>> of Latin 1. This is the opposite situation to the LENIENT writing mode
>>> where we encode header fields using the charset from the Content-Type
>>> field.
>> To me, parsing of MIME headers using any charset other that US-ASCII never
>> made any sense of what so ever, but so be it.
>> So, in the lenient mode, effectively, we would have to do the following:
>> (1) parse headers (at least partially) in order to locate Content-Type
>> header and extract the charset attribute from it, if present; (2) parse all
>> headers again (probably, lazily) using the charset from the Content-Type.
>> That's quite a bit of extra work.
> If we want to parse real world messages then we have to expect also
> non-7bit-ASCII bytes in headers. They are malformed, but they are very
> common.
> IMHO this is a non-issue for "subsequent-roundtripping": mime4j should
> encode them properly in output and be able to roundtrip its own output.

In my opinion roundtripping and "encoding a message properly" are
separate concerns that should be handled independently.

What I have in mind is some kind of visitor or transformer that can be
used to tidy up a message before writing it.

A similar transformer can be used to canonicalize a message for S/MIME
or OpenPGP/MIME.

Writing a message to an output stream should always write the message
as it is at that moment. And if the message contains non-ascii bytes
when it gets written then so be it..

> If instead we want to be able to parse any stream (non valid MIME, or even
> NON MIME at all.. maybe any binary content???) then we'll have to deal with
> this and many more stuff: IMHO this would be cool, but a real PITA. I'd be
> very happy with the "subsequent-roundtripping" (I'm not sure this is the
> same as Robert's "unlimited round tripping", to be sure I created a new term
> ;-) ).

IIUC what you mean by "subsequent-roundtripping" is that Mime4j should
be capable of creating identical output if it parses and writes a
message that has been created by Mime4j itself.

Robert's "unlimited round tripping" means Mime4j should be capable of
creating identical output for any kind of input message.

I would like to achieve "nearly unlimited round tripping" (to invent a
third term ;-) meaning Mime4j should preserve the exact bytes of
header fields and there should be a switch in MessageBuilder to also
preserve the transfer encodings (default value may be false)..

"Nearly" because I don't think that the kinds of line endings (cr, lf
or crlf) need to be preserved. Also if AbstractEntity drops an invalid
header field it cannot be preserved because the ContentHandler never
gets to see it.

>>> Okay, so now assume we have parsed that message and use the LENIENT
>>> writing mode to write it out again. Clearly we have a serious round
>>> tripping issue now, because Latin 1 was used to decode the fields but
>>> the potentially different Content-Type charset is used to encode them
>>> again.
>>> I think the inherent problem is that AbstractEntity attempts to
>>> convert bytes into characters. This should not happen so early in the
>>> process.
>>> In my opinion it would be better if AbstractEntity treated a header
>>> field as a byte array. It would be better to pass a byte array to a
>>> ContentHandler or a BodyDescriptor. The ContentHandler /
>>> BodyDescriptor implementation can then decide how to decode the bytes.
>> This would push the responsibility of detecting the charset and correct
>> parsing of headers to individual ContentHandler implementations and would
>> make the task of implementing a ContentHandler more complex, but probably is
>> the most flexible solution to the problem.
> I'm not sure I understand the technical details, but IMHO the "smart" thing
> is to correctly decode 8bit bytes from headers using the encoding specified
> in the same header (maybe in a following header line!!!) while in output
> always use encoding (so no 8bit in output from mime4j, ever..
> This will fix broken messages, I'm not sure how many PGP/DKIM/SMIME like
> normalizations this would break....
>>> This could really help with the goal of complete round tripping..
>>> Class Field could store the original raw field value in a byte array
>>> instead a String.
>>> One drawback would be that duplicate parsing of header fields is maybe
>>> inevitable..
>>> Opinions?
>> I am in favor of using ByteArrayBuffer at the ContentHandler level, even
>> though this would make the task of implementing it more difficult.
>> Oleg
>>> Markus
>>> PS: I don't indent to stop 0.6 but maybe we should keep the note
>>> regarding round trip issues in the release notes.
> +1
> Stefano

View raw message