james-mime4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Kalnichevski <ol...@apache.org>
Subject Re: Duplicate parsing of some header fields
Date Wed, 04 Feb 2009 13:26:00 GMT
Markus Wiederkehr wrote:
> On Sun, Feb 1, 2009 at 10:34 PM, Robert Burrell Donkin
> <robertburrelldonkin@gmail.com> wrote:
>> On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr
>> <markus.wiederkehr@gmail.com> wrote:
>>> When Mime4j is used to build a DOM some header fields are parsed more
>>> than once..
>> ...i suspected that this might be the case...
>>> This is what happens:
>>>  * AbstractEntity.parseField parses the raw string into a name-value pair
>>>  * AbstractEntity.parseField calls MutableBodyDescriptor.addField for
>>> valid fields
>>>  * DefaultBodyDescriptor parses header fields such as Content-Type
>>>  * MaximalBodyDescriptor parses additional header fields such as Mime-Version
>>>  * eventually MimeStreamParser.parse retrieves the raw field from
>>> AbstractEntity and notifies a ContentHandler
>>>  * the ContentHandler (again) has to parse the raw string into name and value
>>>  * the ContentHandler (again) has to parse certain structural fields
>>> already parsed by a body descriptor
>> IIRC the descriptor stuff was added to satisfy requirements for SAX
>> and pull parsing downstream (in particular IMAP which uses a lot of
>> MIME meta-data)
>>> In case of building a DOM the latter two items are done by
>>> MessageBuilder by calling Field.parse().
>>> There are several issues here:
>>>  * the ContentHandler has to do a lot of work that has already been
>>> done before by AbstractEntity.
>> well, probably done before (dependent of parser settings)
>>>  * Field.parse() uses a JavaCC-generated parser whereas the body
>>> descriptors have their own parsing code.
>> yes - the body descriptors support a wider variety of fields than the
>> Field stuff
>>>  * we have unnecessary duplicate code and the two parsers are very
>>> likely inconsistent with each other.
>> yes - i trust the code in AbstractEntity more highly then that in Field
> I'm not so sure.. For example AbstractEntity uses
> MimeUtil.getHeaderParams() to parse Content-Type and
> Content-Disposition. It looks like getHeaderParams does not correctly
> handle comments that are allowed in quoted-strings.. Consider this
> example:
>         String test = "text/plain; key=value; foo= (test) \"bar\"";
>         Map<String, String> headerParams = MimeUtil.getHeaderParams(test);
>         System.out.println(headerParams);
> The result is {=text/plain, foo=(test), key=value} which is wrong. The
> Field equivalent..
>         ContentTypeField f = Fields.contentType(test);
>         System.out.println(f.getParameters());
> ..gives the correct result: {foo=bar, key=value}.
> But of course we could fix this and use MimeUtil.getHeaderParams() for
> parsing Fields, too. In fact I would not mind if most of the javacc
> parsers were replaced by handcrafted equivalents..


For what is it worth to you. HttpComponents Core provides an extensive 
support for parsing and formatting of MIME headers. The code has been 
around for quite some time, has been relatively well tested, has been 
extensively optimized for performance and memory footprint and is able 
to operate with virtually zero intermediate garbage and no intermediate 
memory copying.

The parsing/formatting framework is based on the CharArrayBuffer class, 
which is used by MimeTokenStream for low level parsing operations.


Feel free to take a look at the code and see if it might make sense 
replacing MimeUtil.getHeaderParams(value) with some bits from HC.

I fully admit my personal bias, so feel free to ignore me.


View raw message