On 2/3/09, Markus Wiederkehr wrote: > On Tue, Feb 3, 2009 at 9:27 PM, Robert Burrell Donkin > wrote: >> On 2/3/09, Markus Wiederkehr wrote: >>> On Sun, Feb 1, 2009 at 10:34 PM, Robert Burrell Donkin >>> wrote: >>>> On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr >>>> wrote: >>>>> When Mime4j is used to build a DOM some header fields are parsed more >>>>> than once.. >>>> >>>> ...i suspected that this might be the case... >>>> >>>>> This is what happens: >>>>> * AbstractEntity.parseField parses the raw string into a name-value >>>>> pair >>>>> * AbstractEntity.parseField calls MutableBodyDescriptor.addField for >>>>> valid fields >>>>> * DefaultBodyDescriptor parses header fields such as Content-Type >>>>> * MaximalBodyDescriptor parses additional header fields such as >>>>> Mime-Version >>>>> * eventually MimeStreamParser.parse retrieves the raw field from >>>>> AbstractEntity and notifies a ContentHandler >>>>> * the ContentHandler (again) has to parse the raw string into name and >>>>> value >>>>> * the ContentHandler (again) has to parse certain structural fields >>>>> already parsed by a body descriptor >>>> >>>> IIRC the descriptor stuff was added to satisfy requirements for SAX >>>> and pull parsing downstream (in particular IMAP which uses a lot of >>>> MIME meta-data) >>>> >>>>> In case of building a DOM the latter two items are done by >>>>> MessageBuilder by calling Field.parse(). >>>>> >>>>> There are several issues here: >>>>> * the ContentHandler has to do a lot of work that has already been >>>>> done before by AbstractEntity. >>>> >>>> well, probably done before (dependent of parser settings) >>>> >>>>> * Field.parse() uses a JavaCC-generated parser whereas the body >>>>> descriptors have their own parsing code. >>>> >>>> yes - the body descriptors support a wider variety of fields than the >>>> Field stuff >>>> >>>>> * we have unnecessary duplicate code and the two parsers are very >>>>> likely inconsistent with each other. >>>> >>>> yes - i trust the code in AbstractEntity more highly then that in Field >>> >>> I'm not so sure.. For example AbstractEntity uses >>> MimeUtil.getHeaderParams() to parse Content-Type and >>> Content-Disposition. It looks like getHeaderParams does not correctly >>> handle comments that are allowed in quoted-strings.. Consider this >>> example: >>> >>> String test = "text/plain; key=value; foo= (test) \"bar\""; >>> Map headerParams = >>> MimeUtil.getHeaderParams(test); >>> System.out.println(headerParams); >>> >>> The result is {=text/plain, foo=(test), key=value} which is wrong. >> >> Comments are a real PITA - users of high level APIs don't want to have >> to reparse values to strip comments. Most of this stuff was intended >> to be called after full normalization. If we switch to fields and >> JavaCC then it might be possible to support comment extraction which >> might be useful one day. > > Unfortunately this full normalization does not happen anywhere so I > think this is actually a bug. I'm not sure whether comment extraction > would make much sense but comments definitely have to be ignored for > now. > > Check it out for yourself: try parsing a message that has a a content > type like "multipart/mixed; boundary=(comment)boundary" and watch > Mime4j fail.. I'm pretty sure that there a number of other ways that the failure to fully normalize causes issues - for example, I doubt unescaping international characters works correctly. >>> The Field equivalent.. >>> >>> ContentTypeField f = Fields.contentType(test); >>> System.out.println(f.getParameters()); >>> >>> ..gives the correct result: {foo=bar, key=value}. >>> >>> But of course we could fix this and use MimeUtil.getHeaderParams() for >>> parsing Fields, too. In fact I would not mind if most of the javacc >>> parsers were replaced by handcrafted equivalents.. >> >> I'm happy to use the JavaCC parsers provided that they are available >> and debugged for all the required types. >> >>> >>>>> * parsing issues have to be addressed twice. >>>> >>>> to support the wider range of fields in the DOM, then yes >>>> >>>>> To resolve these problems I would like to propose a drastic change to >>>>> the >>>>> API: >>>>> * AbstractEntity should use Field.parse to parse the raw field >>>>> * The body descriptor can extract information from the Field >>>>> instances without need of their own parsing code >>>>> * A ContentHandler would be notified of a Field instead of a string >>>>> * MessageBuilder could simply store the field and would not have to >>>>> parse it again. >>>>> >>>>> Please note that with the recent changes to the API all concrete Field >>>>> classes parse the field value lazily so there should be no significant >>>>> performance impact. >>>> >>>> i have no essential objections to the strategy though i think maybe >>>> some discussion on tactics might be worthwhile >>>> >>>> the pull parsing stuff only stores a limited number of headers. for >>>> any header which isn't of interest to the user, the field is only >>>> parsed into header and value. this is about the minimum work required >>>> to discover the header name and so whether it's of interest. >>>> >>>> the typical use case for not being interested in headers is >>>> performance. so, though creating a field is a small hit, it would be >>>> in keeping with the spirit of the API to add something like >>>> isFieldSkippable (but with a better name) as well as modifying >>>> addField to accept a field. then only when the descriptor is >>>> interested in the result would the field need to be created. the SAX >>>> handler would then take every field in full which is fine since it's a >>>> higher level API. >>> >>> Sounds reasonable.. >>> >>>> if this is the approach adopted then there are some improvements which >>>> could be made to the structure of the descriptor classes. i should be >>>> able to find time to help out with them. >>> >>> +1 >>> >>> All in all I think this is something for 0.7.. >> >> I'm getting a little worried that 0.7 is getting pushed further and >> further, not closer. Mime4J is not mature enough to worry about API >> compatibility so I wonder whether we'd be better to just look to cut >> 0.7 soon and then follow up soon with a 0.8 featuring this redesign. > > I actually want to postpone this. Our next release is 0.6 ;-) Good let's write up a JIRA and get 0.6 cut :-) Robert > > Markus >