james-mime4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Wiederkehr <markus.wiederk...@gmail.com>
Subject Re: Duplicate parsing of some header fields
Date Tue, 03 Feb 2009 16:06:58 GMT
On Sun, Feb 1, 2009 at 10:34 PM, Robert Burrell Donkin
<robertburrelldonkin@gmail.com> wrote:
> On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr
> <markus.wiederkehr@gmail.com> wrote:
>> When Mime4j is used to build a DOM some header fields are parsed more
>> than once..
>
> ...i suspected that this might be the case...
>
>> This is what happens:
>>  * AbstractEntity.parseField parses the raw string into a name-value pair
>>  * AbstractEntity.parseField calls MutableBodyDescriptor.addField for
>> valid fields
>>  * DefaultBodyDescriptor parses header fields such as Content-Type
>>  * MaximalBodyDescriptor parses additional header fields such as Mime-Version
>>  * eventually MimeStreamParser.parse retrieves the raw field from
>> AbstractEntity and notifies a ContentHandler
>>  * the ContentHandler (again) has to parse the raw string into name and value
>>  * the ContentHandler (again) has to parse certain structural fields
>> already parsed by a body descriptor
>
> IIRC the descriptor stuff was added to satisfy requirements for SAX
> and pull parsing downstream (in particular IMAP which uses a lot of
> MIME meta-data)
>
>> In case of building a DOM the latter two items are done by
>> MessageBuilder by calling Field.parse().
>>
>> There are several issues here:
>>  * the ContentHandler has to do a lot of work that has already been
>> done before by AbstractEntity.
>
> well, probably done before (dependent of parser settings)
>
>>  * Field.parse() uses a JavaCC-generated parser whereas the body
>> descriptors have their own parsing code.
>
> yes - the body descriptors support a wider variety of fields than the
> Field stuff
>
>>  * we have unnecessary duplicate code and the two parsers are very
>> likely inconsistent with each other.
>
> yes - i trust the code in AbstractEntity more highly then that in Field

I'm not so sure.. For example AbstractEntity uses
MimeUtil.getHeaderParams() to parse Content-Type and
Content-Disposition. It looks like getHeaderParams does not correctly
handle comments that are allowed in quoted-strings.. Consider this
example:

        String test = "text/plain; key=value; foo= (test) \"bar\"";
        Map<String, String> headerParams = MimeUtil.getHeaderParams(test);
        System.out.println(headerParams);

The result is {=text/plain, foo=(test), key=value} which is wrong. The
Field equivalent..

        ContentTypeField f = Fields.contentType(test);
        System.out.println(f.getParameters());

..gives the correct result: {foo=bar, key=value}.

But of course we could fix this and use MimeUtil.getHeaderParams() for
parsing Fields, too. In fact I would not mind if most of the javacc
parsers were replaced by handcrafted equivalents..

>>  * parsing issues have to be addressed twice.
>
> to support the wider range of fields in the DOM, then yes
>
>> To resolve these problems I would like to propose a drastic change to the API:
>>  * AbstractEntity should use Field.parse to parse the raw field
>>  * The body descriptor can extract information from the Field
>> instances without need of their own parsing code
>>  * A ContentHandler would be notified of a Field instead of a string
>>  * MessageBuilder could simply store the field and would not have to
>> parse it again.
>>
>> Please note that with the recent changes to the API all concrete Field
>> classes parse the field value lazily so there should be no significant
>> performance impact.
>
> i have no essential objections to the strategy though i think maybe
> some discussion on tactics might be worthwhile
>
> the pull parsing stuff only stores a limited number of headers. for
> any header which isn't of interest to the user, the field is only
> parsed into header and value. this is about the minimum work required
> to discover the header name and so whether it's of interest.
>
> the typical use case for not being interested in headers is
> performance. so, though creating a field is a small hit, it would be
> in keeping with the spirit of the API to add something like
> isFieldSkippable (but with a better name) as well as modifying
> addField to accept a field. then only when the descriptor is
> interested in the result would the field need to be created. the SAX
> handler would then take every field in full which is fine since it's a
> higher level API.

Sounds reasonable..

> if this is the approach adopted then there are some improvements which
> could be made to the structure of the descriptor classes. i should be
> able to find time to help out with them.

+1

All in all I think this is something for 0.7..

Markus

Mime
View raw message