james-mime4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Wiederkehr <markus.wiederk...@gmail.com>
Subject Re: Round tripping (MIME4J-112)
Date Thu, 12 Feb 2009 19:31:03 GMT
On Tue, Feb 10, 2009 at 8:28 PM, Oleg Kalnichevski <olegk@apache.org> wrote:
> Markus Wiederkehr wrote:
>>
>> I've been investigating the current code a little bit more and I've
>> come to think that something really goes wrong. Please have a look at
>> the code.
>>
>> Class AbstractEntity has a ByteArrayBuffer "linebuf" and a
>> CharArrayBuffer "fieldbuf". Method fillFieldBuffer() copies bytes from
>> the underlying stream to "linebuf" (line 146:
>> instream.readLine(linebuf)). Later on the input bytes are appended to
>> "fieldbuf" (line 143: fieldbuf.append(linebuf, 0, len)). At this point
>> bytes are decoded into characters. A closer look at CharArrayBuffer
>> reveals how this is done:
>>
>>    int ch = b[i1];
>>    if (ch < 0) {
>>        ch = 256 + ch;
>>    }
>>
>> This is equivalent to ISO-8859-1 conversion because Latin 1 is the
>> only charset that directly maps byte codes 00 to ff to unicode code
>> points 0000 to 00ff.
>>
>> All works well as long as the underlying stream only contains ASCII bytes.
>>
>> But assume the message contains non-ASCII bytes and a Content-Type
>> field with a charset parameter is also present. In this case the input
>> bytes should probably be decoded using that specified charset instead
>> of Latin 1. This is the opposite situation to the LENIENT writing mode
>> where we encode header fields using the charset from the Content-Type
>> field.
>>
>
> To me, parsing of MIME headers using any charset other that US-ASCII never
> made any sense of what so ever, but so be it.

Interesting.. I always thought that the lenient writing mode is a
requirement coming from the http world or HC more specifically. And if
someone wants to write non-ascii chars using the content-type charset
it seems only natural that a corresponding parsing mode is also
required.

> So, in the lenient mode, effectively, we would have to do the following: (1)
> parse headers (at least partially) in order to locate Content-Type header
> and extract the charset attribute from it, if present; (2) parse all headers
> again (probably, lazily) using the charset from the Content-Type.

Exactly.

> That's quite a bit of extra work.

Well, actually not very much, at least when compared to the current
code. Currently header fields are already parsed twice, by
AbstractEntity and a second time by MessageBuilder. MessageBuilder
would only have to keep the raw fields in a list and defer the Field
creation until endHeader() is invoked..

>> Okay, so now assume we have parsed that message and use the LENIENT
>> writing mode to write it out again. Clearly we have a serious round
>> tripping issue now, because Latin 1 was used to decode the fields but
>> the potentially different Content-Type charset is used to encode them
>> again.
>>
>> I think the inherent problem is that AbstractEntity attempts to
>> convert bytes into characters. This should not happen so early in the
>> process.
>>
>> In my opinion it would be better if AbstractEntity treated a header
>> field as a byte array. It would be better to pass a byte array to a
>> ContentHandler or a BodyDescriptor. The ContentHandler /
>> BodyDescriptor implementation can then decide how to decode the bytes.
>>
>
> This would push the responsibility of detecting the charset and correct
> parsing of headers to individual ContentHandler implementations and would
> make the task of implementing a ContentHandler more complex, but probably is
> the most flexible solution to the problem.

A ContentHandler receives everything else as a sequence of bytes; the
message body as well as preamble/epilogue of a multipart. Only the
headers come as string.

>> This could really help with the goal of complete round tripping..
>> Class Field could store the original raw field value in a byte array
>> instead a String.
>>
>> One drawback would be that duplicate parsing of header fields is maybe
>> inevitable..
>>
>> Opinions?
>>
>
> I am in favor of using ByteArrayBuffer at the ContentHandler level, even
> though this would make the task of implementing it more difficult.
>
> Oleg
>
>> Markus
>>
>> PS: I don't indent to stop 0.6 but maybe we should keep the note
>> regarding round trip issues in the release notes.

Mime
View raw message