james-mime4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Wiederkehr <markus.wiederk...@gmail.com>
Subject Re: [jira] Assigned: (MIME4J-118) MIME stream parser handles non-ASCII fields incorrectly
Date Mon, 16 Feb 2009 14:12:52 GMT
On Mon, Feb 16, 2009 at 2:49 PM, Stefano Bagnara <apache@bago.org> wrote:
> Markus Wiederkehr ha scritto:
>> In my opinion this issue is closely related to MIME4J-112 and MIME4J-116.
>>
>> I think that in the course of MIME4J-116 we should (maybe) create
>> Field instances in AbstractEntity instead of later on in
>> MessageBuilder. A Field object could store the raw data in a byte[]
>> instead of a String which would greatly help with MIME4J-112.
>>
>> The only problem is that the charset for a lenient parsing mode is not
>> known at this early point. But considering your clarification about
>> the lenient writing mode I wonder if anybody really needs a lenient
>> parsing mode. (I wonder if anyone really needs a lenient writing mode
>> for that matter.)
>
> Lenient Writing IMO is only needed if you need roundtrip. For
> standard/most MIME4J usages I don't see why we should write malformed
> data in output.

In my opinion Field should preserve the original bytes in a byte
array. Writing a message could simply use these original bytes and
there would be no roundtrip issues. Essentially there would be only
one writing mode.

In additional I would like to have a "visitor" or whatever that can be
used to tidy up a message.

> Lenient reading instead is part of  being a generic parsing library:
> most email clients correctly handle 8bit chars in the Subject header
> because it happens than some email client writes them unencoded. If you
> think mime4j could be used as the library for an email client it
> probably still worth handling 8bit chars in the headers.
> Of course there is no need to implement such a feature until someone
> really ask/need it.

My approach would still allow for that with a little overhead. If a
ContentHandler receives a Field and that field contains the original
raw bytes then nothing prevents the ContentHandler from parsing the
fields again; using any charset determined by whatever means. Also
structured fields are parsed lazily so the overhead would not be
tremendous.

> I don't really know nowadays how many email messages contains unencoded
> headers. 10 years ago, when I checked this stuff deeply almost 40% of
> international emails included unencoded headers. I expect this
> percentage to be much less today, but I don't know if it is 10% or 0.1%.
>
> Stefano
>
>> So maybe AbstractEntity should simply use US-ASCII to decode the
>> header fields without direct support for a lenient parsing mode that
>> nobody needs. Then AbstractEntity can build Field instances and a
>> ContentHandler receives those Field instances without having to parse
>> them again.
>>
>> All in all I'm not sure if #118 should be addressed independently of
>> 112 and 116 and whether 118 should be targeted for 0.6..
>>
>> But those are just my 2 cents,
>>
>> Markus
>>
>>
>> On Mon, Feb 16, 2009 at 1:27 PM, Oleg Kalnichevski (JIRA)
>> <mime4j-dev@james.apache.org> wrote:
>>>     [ https://issues.apache.org/jira/browse/MIME4J-118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
>>>
>>> Oleg Kalnichevski reassigned MIME4J-118:
>>> ----------------------------------------
>>>
>>>    Assignee: oleg.kalnichevski
>>>
>>> Working on a patch
>>>
>>> Oleg
>>>
>>>> MIME stream parser handles non-ASCII fields incorrectly
>>>> -------------------------------------------------------
>>>>
>>>>                 Key: MIME4J-118
>>>>                 URL: https://issues.apache.org/jira/browse/MIME4J-118
>>>>             Project: JAMES Mime4j
>>>>          Issue Type: Bug
>>>>            Reporter: Oleg Kalnichevski
>>>>            Assignee: oleg.kalnichevski
>>>>             Fix For: 0.6
>>>>
>>>>
>>>> Presently MIME stream parser handles non-ASCII fields incorrectly. Binary
field content gets converted to its textual representation too early in the parsing process
using simple byte to char cast. The decision about appropriate char encoding should be left
up to individual ContentHandler implementations.
>>>> Oleg
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>

Mime
View raw message