james-mime4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Burrell Donkin <robertburrelldon...@gmail.com>
Subject Re: Duplicate parsing of some header fields
Date Sun, 01 Feb 2009 21:34:26 GMT
On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr
<markus.wiederkehr@gmail.com> wrote:
> When Mime4j is used to build a DOM some header fields are parsed more
> than once..

...i suspected that this might be the case...

> This is what happens:
>  * AbstractEntity.parseField parses the raw string into a name-value pair
>  * AbstractEntity.parseField calls MutableBodyDescriptor.addField for
> valid fields
>  * DefaultBodyDescriptor parses header fields such as Content-Type
>  * MaximalBodyDescriptor parses additional header fields such as Mime-Version
>  * eventually MimeStreamParser.parse retrieves the raw field from
> AbstractEntity and notifies a ContentHandler
>  * the ContentHandler (again) has to parse the raw string into name and value
>  * the ContentHandler (again) has to parse certain structural fields
> already parsed by a body descriptor

IIRC the descriptor stuff was added to satisfy requirements for SAX
and pull parsing downstream (in particular IMAP which uses a lot of
MIME meta-data)

> In case of building a DOM the latter two items are done by
> MessageBuilder by calling Field.parse().
> There are several issues here:
>  * the ContentHandler has to do a lot of work that has already been
> done before by AbstractEntity.

well, probably done before (dependent of parser settings)

>  * Field.parse() uses a JavaCC-generated parser whereas the body
> descriptors have their own parsing code.

yes - the body descriptors support a wider variety of fields than the
Field stuff

>  * we have unnecessary duplicate code and the two parsers are very
> likely inconsistent with each other.

yes - i trust the code in AbstractEntity more highly then that in Field

>  * parsing issues have to be addressed twice.

to support the wider range of fields in the DOM, then yes

> To resolve these problems I would like to propose a drastic change to the API:
>  * AbstractEntity should use Field.parse to parse the raw field
>  * The body descriptor can extract information from the Field
> instances without need of their own parsing code
>  * A ContentHandler would be notified of a Field instead of a string
>  * MessageBuilder could simply store the field and would not have to
> parse it again.
> Please note that with the recent changes to the API all concrete Field
> classes parse the field value lazily so there should be no significant
> performance impact.

i have no essential objections to the strategy though i think maybe
some discussion on tactics might be worthwhile

the pull parsing stuff only stores a limited number of headers. for
any header which isn't of interest to the user, the field is only
parsed into header and value. this is about the minimum work required
to discover the header name and so whether it's of interest.

the typical use case for not being interested in headers is
performance. so, though creating a field is a small hit, it would be
in keeping with the spirit of the API to add something like
isFieldSkippable (but with a better name) as well as modifying
addField to accept a field. then only when the descriptor is
interested in the result would the field need to be created. the SAX
handler would then take every field in full which is fine since it's a
higher level API.

if this is the approach adopted then there are some improvements which
could be made to the structure of the descriptor classes. i should be
able to find time to help out with them.

- robert

View raw message