james-mime4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Kalnichevski <ol...@apache.org>
Subject Re: Round tripping (MIME4J-112)
Date Sun, 15 Feb 2009 15:05:25 GMT
Markus Wiederkehr wrote:
> On Tue, Feb 10, 2009 at 8:28 PM, Oleg Kalnichevski <olegk@apache.org> wrote:
>> Markus Wiederkehr wrote:
>>> I've been investigating the current code a little bit more and I've
>>> come to think that something really goes wrong. Please have a look at
>>> the code.
>>> Class AbstractEntity has a ByteArrayBuffer "linebuf" and a
>>> CharArrayBuffer "fieldbuf". Method fillFieldBuffer() copies bytes from
>>> the underlying stream to "linebuf" (line 146:
>>> instream.readLine(linebuf)). Later on the input bytes are appended to
>>> "fieldbuf" (line 143: fieldbuf.append(linebuf, 0, len)). At this point
>>> bytes are decoded into characters. A closer look at CharArrayBuffer
>>> reveals how this is done:
>>>    int ch = b[i1];
>>>    if (ch < 0) {
>>>        ch = 256 + ch;
>>>    }
>>> This is equivalent to ISO-8859-1 conversion because Latin 1 is the
>>> only charset that directly maps byte codes 00 to ff to unicode code
>>> points 0000 to 00ff.
>>> All works well as long as the underlying stream only contains ASCII bytes.
>>> But assume the message contains non-ASCII bytes and a Content-Type
>>> field with a charset parameter is also present. In this case the input
>>> bytes should probably be decoded using that specified charset instead
>>> of Latin 1. This is the opposite situation to the LENIENT writing mode
>>> where we encode header fields using the charset from the Content-Type
>>> field.
>> To me, parsing of MIME headers using any charset other that US-ASCII never
>> made any sense of what so ever, but so be it.
> Interesting.. I always thought that the lenient writing mode is a
> requirement coming from the http world or HC more specifically. 

If my memory does not fail me it was the other way around. Prior to 0.4 
mime4j only supported what is now called lenient mode.

And if
> someone wants to write non-ascii chars using the content-type charset
> it seems only natural that a corresponding parsing mode is also
> required.

True. This is very reasonable.

>> So, in the lenient mode, effectively, we would have to do the following: (1)
>> parse headers (at least partially) in order to locate Content-Type header
>> and extract the charset attribute from it, if present; (2) parse all headers
>> again (probably, lazily) using the charset from the Content-Type.
> Exactly.
>> That's quite a bit of extra work.
> Well, actually not very much, at least when compared to the current
> code. Currently header fields are already parsed twice, by
> AbstractEntity and a second time by MessageBuilder. MessageBuilder
> would only have to keep the raw fields in a list and defer the Field
> creation until endHeader() is invoked..
>>> Okay, so now assume we have parsed that message and use the LENIENT
>>> writing mode to write it out again. Clearly we have a serious round
>>> tripping issue now, because Latin 1 was used to decode the fields but
>>> the potentially different Content-Type charset is used to encode them
>>> again.
>>> I think the inherent problem is that AbstractEntity attempts to
>>> convert bytes into characters. This should not happen so early in the
>>> process.
>>> In my opinion it would be better if AbstractEntity treated a header
>>> field as a byte array. It would be better to pass a byte array to a
>>> ContentHandler or a BodyDescriptor. The ContentHandler /
>>> BodyDescriptor implementation can then decide how to decode the bytes.
>> This would push the responsibility of detecting the charset and correct
>> parsing of headers to individual ContentHandler implementations and would
>> make the task of implementing a ContentHandler more complex, but probably is
>> the most flexible solution to the problem.
> A ContentHandler receives everything else as a sequence of bytes; the
> message body as well as preamble/epilogue of a multipart. Only the
> headers come as string.
>>> This could really help with the goal of complete round tripping..
>>> Class Field could store the original raw field value in a byte array
>>> instead a String.
>>> One drawback would be that duplicate parsing of header fields is maybe
>>> inevitable..
>>> Opinions?
>> I am in favor of using ByteArrayBuffer at the ContentHandler level, even
>> though this would make the task of implementing it more difficult.

All right. If there are no objections, I can put together a patch along 
the lines outlined above.


>> Oleg
>>> Markus
>>> PS: I don't indent to stop 0.6 but maybe we should keep the note
>>> regarding round trip issues in the release notes.

View raw message