james-mime4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Burrell Donkin <robertburrelldon...@gmail.com>
Subject Re: Duplicate parsing of some header fields
Date Tue, 03 Feb 2009 21:08:37 GMT
On 2/3/09, Markus Wiederkehr <markus.wiederkehr@gmail.com> wrote:
> On Tue, Feb 3, 2009 at 9:27 PM, Robert Burrell Donkin
> <robertburrelldonkin@gmail.com> wrote:
>> On 2/3/09, Markus Wiederkehr <markus.wiederkehr@gmail.com> wrote:
>>> On Sun, Feb 1, 2009 at 10:34 PM, Robert Burrell Donkin
>>> <robertburrelldonkin@gmail.com> wrote:
>>>> On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr
>>>> <markus.wiederkehr@gmail.com> wrote:
>>>>> When Mime4j is used to build a DOM some header fields are parsed more
>>>>> than once..
>>>> ...i suspected that this might be the case...
>>>>> This is what happens:
>>>>>  * AbstractEntity.parseField parses the raw string into a name-value
>>>>> pair
>>>>>  * AbstractEntity.parseField calls MutableBodyDescriptor.addField for
>>>>> valid fields
>>>>>  * DefaultBodyDescriptor parses header fields such as Content-Type
>>>>>  * MaximalBodyDescriptor parses additional header fields such as
>>>>> Mime-Version
>>>>>  * eventually MimeStreamParser.parse retrieves the raw field from
>>>>> AbstractEntity and notifies a ContentHandler
>>>>>  * the ContentHandler (again) has to parse the raw string into name and
>>>>> value
>>>>>  * the ContentHandler (again) has to parse certain structural fields
>>>>> already parsed by a body descriptor
>>>> IIRC the descriptor stuff was added to satisfy requirements for SAX
>>>> and pull parsing downstream (in particular IMAP which uses a lot of
>>>> MIME meta-data)
>>>>> In case of building a DOM the latter two items are done by
>>>>> MessageBuilder by calling Field.parse().
>>>>> There are several issues here:
>>>>>  * the ContentHandler has to do a lot of work that has already been
>>>>> done before by AbstractEntity.
>>>> well, probably done before (dependent of parser settings)
>>>>>  * Field.parse() uses a JavaCC-generated parser whereas the body
>>>>> descriptors have their own parsing code.
>>>> yes - the body descriptors support a wider variety of fields than the
>>>> Field stuff
>>>>>  * we have unnecessary duplicate code and the two parsers are very
>>>>> likely inconsistent with each other.
>>>> yes - i trust the code in AbstractEntity more highly then that in Field
>>> I'm not so sure.. For example AbstractEntity uses
>>> MimeUtil.getHeaderParams() to parse Content-Type and
>>> Content-Disposition. It looks like getHeaderParams does not correctly
>>> handle comments that are allowed in quoted-strings.. Consider this
>>> example:
>>>         String test = "text/plain; key=value; foo= (test) \"bar\"";
>>>         Map<String, String> headerParams =
>>> MimeUtil.getHeaderParams(test);
>>>         System.out.println(headerParams);
>>> The result is {=text/plain, foo=(test), key=value} which is wrong.
>> Comments are a real PITA - users of high level APIs don't want to have
>> to reparse values to strip comments. Most of this stuff was intended
>> to be called after full normalization. If we switch to fields and
>> JavaCC then it might be possible to support comment extraction which
>> might be useful one day.
> Unfortunately this full normalization does not happen anywhere so I
> think this is actually a bug. I'm not sure whether comment extraction
> would make much sense but comments definitely have to be ignored for
> now.
> Check it out for yourself: try parsing a message that has a a content
> type like "multipart/mixed; boundary=(comment)boundary" and watch
> Mime4j fail..

I'm pretty sure that there a number of other ways that the failure to
fully normalize causes issues - for example, I doubt unescaping
international characters works correctly.

>>> The Field equivalent..
>>>         ContentTypeField f = Fields.contentType(test);
>>>         System.out.println(f.getParameters());
>>> ..gives the correct result: {foo=bar, key=value}.
>>> But of course we could fix this and use MimeUtil.getHeaderParams() for
>>> parsing Fields, too. In fact I would not mind if most of the javacc
>>> parsers were replaced by handcrafted equivalents..
>> I'm happy to use the JavaCC parsers provided that they are available
>> and debugged for all the required types.
>>>>>  * parsing issues have to be addressed twice.
>>>> to support the wider range of fields in the DOM, then yes
>>>>> To resolve these problems I would like to propose a drastic change to
>>>>> the
>>>>> API:
>>>>>  * AbstractEntity should use Field.parse to parse the raw field
>>>>>  * The body descriptor can extract information from the Field
>>>>> instances without need of their own parsing code
>>>>>  * A ContentHandler would be notified of a Field instead of a string
>>>>>  * MessageBuilder could simply store the field and would not have to
>>>>> parse it again.
>>>>> Please note that with the recent changes to the API all concrete Field
>>>>> classes parse the field value lazily so there should be no significant
>>>>> performance impact.
>>>> i have no essential objections to the strategy though i think maybe
>>>> some discussion on tactics might be worthwhile
>>>> the pull parsing stuff only stores a limited number of headers. for
>>>> any header which isn't of interest to the user, the field is only
>>>> parsed into header and value. this is about the minimum work required
>>>> to discover the header name and so whether it's of interest.
>>>> the typical use case for not being interested in headers is
>>>> performance. so, though creating a field is a small hit, it would be
>>>> in keeping with the spirit of the API to add something like
>>>> isFieldSkippable (but with a better name) as well as modifying
>>>> addField to accept a field. then only when the descriptor is
>>>> interested in the result would the field need to be created. the SAX
>>>> handler would then take every field in full which is fine since it's a
>>>> higher level API.
>>> Sounds reasonable..
>>>> if this is the approach adopted then there are some improvements which
>>>> could be made to the structure of the descriptor classes. i should be
>>>> able to find time to help out with them.
>>> +1
>>> All in all I think this is something for 0.7..
>> I'm getting a little worried that 0.7 is getting pushed further and
>> further, not closer. Mime4J is not mature enough to worry about API
>> compatibility so I wonder whether we'd be better to just look to cut
>> 0.7 soon and then follow up soon with a 0.8 featuring this redesign.
> I actually want to postpone this. Our next release is 0.6 ;-)

Good let's write up a JIRA and get 0.6 cut :-)


> Markus

View raw message