james-mime4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Burrell Donkin <robertburrelldon...@gmail.com>
Subject Re: Duplicate parsing of some header fields
Date Tue, 03 Feb 2009 20:27:58 GMT
On 2/3/09, Markus Wiederkehr <markus.wiederkehr@gmail.com> wrote:
> On Sun, Feb 1, 2009 at 10:34 PM, Robert Burrell Donkin
> <robertburrelldonkin@gmail.com> wrote:
>> On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr
>> <markus.wiederkehr@gmail.com> wrote:
>>> When Mime4j is used to build a DOM some header fields are parsed more
>>> than once..
>>
>> ...i suspected that this might be the case...
>>
>>> This is what happens:
>>>  * AbstractEntity.parseField parses the raw string into a name-value pair
>>>  * AbstractEntity.parseField calls MutableBodyDescriptor.addField for
>>> valid fields
>>>  * DefaultBodyDescriptor parses header fields such as Content-Type
>>>  * MaximalBodyDescriptor parses additional header fields such as
>>> Mime-Version
>>>  * eventually MimeStreamParser.parse retrieves the raw field from
>>> AbstractEntity and notifies a ContentHandler
>>>  * the ContentHandler (again) has to parse the raw string into name and
>>> value
>>>  * the ContentHandler (again) has to parse certain structural fields
>>> already parsed by a body descriptor
>>
>> IIRC the descriptor stuff was added to satisfy requirements for SAX
>> and pull parsing downstream (in particular IMAP which uses a lot of
>> MIME meta-data)
>>
>>> In case of building a DOM the latter two items are done by
>>> MessageBuilder by calling Field.parse().
>>>
>>> There are several issues here:
>>>  * the ContentHandler has to do a lot of work that has already been
>>> done before by AbstractEntity.
>>
>> well, probably done before (dependent of parser settings)
>>
>>>  * Field.parse() uses a JavaCC-generated parser whereas the body
>>> descriptors have their own parsing code.
>>
>> yes - the body descriptors support a wider variety of fields than the
>> Field stuff
>>
>>>  * we have unnecessary duplicate code and the two parsers are very
>>> likely inconsistent with each other.
>>
>> yes - i trust the code in AbstractEntity more highly then that in Field
>
> I'm not so sure.. For example AbstractEntity uses
> MimeUtil.getHeaderParams() to parse Content-Type and
> Content-Disposition. It looks like getHeaderParams does not correctly
> handle comments that are allowed in quoted-strings.. Consider this
> example:
>
>         String test = "text/plain; key=value; foo= (test) \"bar\"";
>         Map<String, String> headerParams = MimeUtil.getHeaderParams(test);
>         System.out.println(headerParams);
>
> The result is {=text/plain, foo=(test), key=value} which is wrong.

Comments are a real PITA - users of high level APIs don't want to have
to reparse values to strip comments. Most of this stuff was intended
to be called after full normalization. If we switch to fields and
JavaCC then it might be possible to support comment extraction which
might be useful one day.

> The Field equivalent..
>
>         ContentTypeField f = Fields.contentType(test);
>         System.out.println(f.getParameters());
>
> ..gives the correct result: {foo=bar, key=value}.
>
> But of course we could fix this and use MimeUtil.getHeaderParams() for
> parsing Fields, too. In fact I would not mind if most of the javacc
> parsers were replaced by handcrafted equivalents..

I'm happy to use the JavaCC parsers provided that they are available
and debugged for all the required types.

>
>>>  * parsing issues have to be addressed twice.
>>
>> to support the wider range of fields in the DOM, then yes
>>
>>> To resolve these problems I would like to propose a drastic change to the
>>> API:
>>>  * AbstractEntity should use Field.parse to parse the raw field
>>>  * The body descriptor can extract information from the Field
>>> instances without need of their own parsing code
>>>  * A ContentHandler would be notified of a Field instead of a string
>>>  * MessageBuilder could simply store the field and would not have to
>>> parse it again.
>>>
>>> Please note that with the recent changes to the API all concrete Field
>>> classes parse the field value lazily so there should be no significant
>>> performance impact.
>>
>> i have no essential objections to the strategy though i think maybe
>> some discussion on tactics might be worthwhile
>>
>> the pull parsing stuff only stores a limited number of headers. for
>> any header which isn't of interest to the user, the field is only
>> parsed into header and value. this is about the minimum work required
>> to discover the header name and so whether it's of interest.
>>
>> the typical use case for not being interested in headers is
>> performance. so, though creating a field is a small hit, it would be
>> in keeping with the spirit of the API to add something like
>> isFieldSkippable (but with a better name) as well as modifying
>> addField to accept a field. then only when the descriptor is
>> interested in the result would the field need to be created. the SAX
>> handler would then take every field in full which is fine since it's a
>> higher level API.
>
> Sounds reasonable..
>
>> if this is the approach adopted then there are some improvements which
>> could be made to the structure of the descriptor classes. i should be
>> able to find time to help out with them.
>
> +1
>
> All in all I think this is something for 0.7..

I'm getting a little worried that 0.7 is getting pushed further and
further, not closer. Mime4J is not mature enough to worry about API
compatibility so I wonder whether we'd be better to just look to cut
0.7 soon and then follow up soon with a 0.8 featuring this redesign.

- Robert

Mime
View raw message