james-mime4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Wiederkehr <markus.wiederk...@gmail.com>
Subject Re: Duplicate parsing of some header fields
Date Tue, 03 Feb 2009 20:45:37 GMT
On Tue, Feb 3, 2009 at 9:27 PM, Robert Burrell Donkin
<robertburrelldonkin@gmail.com> wrote:
> On 2/3/09, Markus Wiederkehr <markus.wiederkehr@gmail.com> wrote:
>> On Sun, Feb 1, 2009 at 10:34 PM, Robert Burrell Donkin
>> <robertburrelldonkin@gmail.com> wrote:
>>> On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr
>>> <markus.wiederkehr@gmail.com> wrote:
>>>> When Mime4j is used to build a DOM some header fields are parsed more
>>>> than once..
>>> ...i suspected that this might be the case...
>>>> This is what happens:
>>>>  * AbstractEntity.parseField parses the raw string into a name-value pair
>>>>  * AbstractEntity.parseField calls MutableBodyDescriptor.addField for
>>>> valid fields
>>>>  * DefaultBodyDescriptor parses header fields such as Content-Type
>>>>  * MaximalBodyDescriptor parses additional header fields such as
>>>> Mime-Version
>>>>  * eventually MimeStreamParser.parse retrieves the raw field from
>>>> AbstractEntity and notifies a ContentHandler
>>>>  * the ContentHandler (again) has to parse the raw string into name and
>>>> value
>>>>  * the ContentHandler (again) has to parse certain structural fields
>>>> already parsed by a body descriptor
>>> IIRC the descriptor stuff was added to satisfy requirements for SAX
>>> and pull parsing downstream (in particular IMAP which uses a lot of
>>> MIME meta-data)
>>>> In case of building a DOM the latter two items are done by
>>>> MessageBuilder by calling Field.parse().
>>>> There are several issues here:
>>>>  * the ContentHandler has to do a lot of work that has already been
>>>> done before by AbstractEntity.
>>> well, probably done before (dependent of parser settings)
>>>>  * Field.parse() uses a JavaCC-generated parser whereas the body
>>>> descriptors have their own parsing code.
>>> yes - the body descriptors support a wider variety of fields than the
>>> Field stuff
>>>>  * we have unnecessary duplicate code and the two parsers are very
>>>> likely inconsistent with each other.
>>> yes - i trust the code in AbstractEntity more highly then that in Field
>> I'm not so sure.. For example AbstractEntity uses
>> MimeUtil.getHeaderParams() to parse Content-Type and
>> Content-Disposition. It looks like getHeaderParams does not correctly
>> handle comments that are allowed in quoted-strings.. Consider this
>> example:
>>         String test = "text/plain; key=value; foo= (test) \"bar\"";
>>         Map<String, String> headerParams = MimeUtil.getHeaderParams(test);
>>         System.out.println(headerParams);
>> The result is {=text/plain, foo=(test), key=value} which is wrong.
> Comments are a real PITA - users of high level APIs don't want to have
> to reparse values to strip comments. Most of this stuff was intended
> to be called after full normalization. If we switch to fields and
> JavaCC then it might be possible to support comment extraction which
> might be useful one day.

Unfortunately this full normalization does not happen anywhere so I
think this is actually a bug. I'm not sure whether comment extraction
would make much sense but comments definitely have to be ignored for

Check it out for yourself: try parsing a message that has a a content
type like "multipart/mixed; boundary=(comment)boundary" and watch
Mime4j fail..

>> The Field equivalent..
>>         ContentTypeField f = Fields.contentType(test);
>>         System.out.println(f.getParameters());
>> ..gives the correct result: {foo=bar, key=value}.
>> But of course we could fix this and use MimeUtil.getHeaderParams() for
>> parsing Fields, too. In fact I would not mind if most of the javacc
>> parsers were replaced by handcrafted equivalents..
> I'm happy to use the JavaCC parsers provided that they are available
> and debugged for all the required types.
>>>>  * parsing issues have to be addressed twice.
>>> to support the wider range of fields in the DOM, then yes
>>>> To resolve these problems I would like to propose a drastic change to the
>>>> API:
>>>>  * AbstractEntity should use Field.parse to parse the raw field
>>>>  * The body descriptor can extract information from the Field
>>>> instances without need of their own parsing code
>>>>  * A ContentHandler would be notified of a Field instead of a string
>>>>  * MessageBuilder could simply store the field and would not have to
>>>> parse it again.
>>>> Please note that with the recent changes to the API all concrete Field
>>>> classes parse the field value lazily so there should be no significant
>>>> performance impact.
>>> i have no essential objections to the strategy though i think maybe
>>> some discussion on tactics might be worthwhile
>>> the pull parsing stuff only stores a limited number of headers. for
>>> any header which isn't of interest to the user, the field is only
>>> parsed into header and value. this is about the minimum work required
>>> to discover the header name and so whether it's of interest.
>>> the typical use case for not being interested in headers is
>>> performance. so, though creating a field is a small hit, it would be
>>> in keeping with the spirit of the API to add something like
>>> isFieldSkippable (but with a better name) as well as modifying
>>> addField to accept a field. then only when the descriptor is
>>> interested in the result would the field need to be created. the SAX
>>> handler would then take every field in full which is fine since it's a
>>> higher level API.
>> Sounds reasonable..
>>> if this is the approach adopted then there are some improvements which
>>> could be made to the structure of the descriptor classes. i should be
>>> able to find time to help out with them.
>> +1
>> All in all I think this is something for 0.7..
> I'm getting a little worried that 0.7 is getting pushed further and
> further, not closer. Mime4J is not mature enough to worry about API
> compatibility so I wonder whether we'd be better to just look to cut
> 0.7 soon and then follow up soon with a 0.8 featuring this redesign.

I actually want to postpone this. Our next release is 0.6 ;-)


View raw message