james-mime4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Kalnichevski <ol...@apache.org>
Subject Re: Round tripping (MIME4J-112)
Date Tue, 10 Feb 2009 19:28:46 GMT
Markus Wiederkehr wrote:
> I've been investigating the current code a little bit more and I've
> come to think that something really goes wrong. Please have a look at
> the code.
> 
> Class AbstractEntity has a ByteArrayBuffer "linebuf" and a
> CharArrayBuffer "fieldbuf". Method fillFieldBuffer() copies bytes from
> the underlying stream to "linebuf" (line 146:
> instream.readLine(linebuf)). Later on the input bytes are appended to
> "fieldbuf" (line 143: fieldbuf.append(linebuf, 0, len)). At this point
> bytes are decoded into characters. A closer look at CharArrayBuffer
> reveals how this is done:
> 
>     int ch = b[i1];
>     if (ch < 0) {
>         ch = 256 + ch;
>     }
> 
> This is equivalent to ISO-8859-1 conversion because Latin 1 is the
> only charset that directly maps byte codes 00 to ff to unicode code
> points 0000 to 00ff.
> 
> All works well as long as the underlying stream only contains ASCII bytes.
> 
> But assume the message contains non-ASCII bytes and a Content-Type
> field with a charset parameter is also present. In this case the input
> bytes should probably be decoded using that specified charset instead
> of Latin 1. This is the opposite situation to the LENIENT writing mode
> where we encode header fields using the charset from the Content-Type
> field.
> 

To me, parsing of MIME headers using any charset other that US-ASCII 
never made any sense of what so ever, but so be it.

So, in the lenient mode, effectively, we would have to do the following: 
(1) parse headers (at least partially) in order to locate Content-Type 
header and extract the charset attribute from it, if present; (2) parse 
all headers again (probably, lazily) using the charset from the 
Content-Type.

That's quite a bit of extra work.


> Okay, so now assume we have parsed that message and use the LENIENT
> writing mode to write it out again. Clearly we have a serious round
> tripping issue now, because Latin 1 was used to decode the fields but
> the potentially different Content-Type charset is used to encode them
> again.
> 
> I think the inherent problem is that AbstractEntity attempts to
> convert bytes into characters. This should not happen so early in the
> process.
> 
> In my opinion it would be better if AbstractEntity treated a header
> field as a byte array. It would be better to pass a byte array to a
> ContentHandler or a BodyDescriptor. The ContentHandler /
> BodyDescriptor implementation can then decide how to decode the bytes.
> 

This would push the responsibility of detecting the charset and correct 
parsing of headers to individual ContentHandler implementations and 
would make the task of implementing a ContentHandler more complex, but 
probably is the most flexible solution to the problem.


> This could really help with the goal of complete round tripping..
> Class Field could store the original raw field value in a byte array
> instead a String.
> 
> One drawback would be that duplicate parsing of header fields is maybe
> inevitable..
> 
> Opinions?
> 

I am in favor of using ByteArrayBuffer at the ContentHandler level, even 
though this would make the task of implementing it more difficult.

Oleg

> Markus
> 
> PS: I don't indent to stop 0.6 but maybe we should keep the note
> regarding round trip issues in the release notes.


Mime
View raw message