On Sun, Feb 1, 2009 at 10:34 PM, Robert Burrell Donkin wrote: > On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr > wrote: >> When Mime4j is used to build a DOM some header fields are parsed more >> than once.. > > ...i suspected that this might be the case... > >> This is what happens: >> * AbstractEntity.parseField parses the raw string into a name-value pair >> * AbstractEntity.parseField calls MutableBodyDescriptor.addField for >> valid fields >> * DefaultBodyDescriptor parses header fields such as Content-Type >> * MaximalBodyDescriptor parses additional header fields such as Mime-Version >> * eventually MimeStreamParser.parse retrieves the raw field from >> AbstractEntity and notifies a ContentHandler >> * the ContentHandler (again) has to parse the raw string into name and value >> * the ContentHandler (again) has to parse certain structural fields >> already parsed by a body descriptor > > IIRC the descriptor stuff was added to satisfy requirements for SAX > and pull parsing downstream (in particular IMAP which uses a lot of > MIME meta-data) > >> In case of building a DOM the latter two items are done by >> MessageBuilder by calling Field.parse(). >> >> There are several issues here: >> * the ContentHandler has to do a lot of work that has already been >> done before by AbstractEntity. > > well, probably done before (dependent of parser settings) > >> * Field.parse() uses a JavaCC-generated parser whereas the body >> descriptors have their own parsing code. > > yes - the body descriptors support a wider variety of fields than the > Field stuff > >> * we have unnecessary duplicate code and the two parsers are very >> likely inconsistent with each other. > > yes - i trust the code in AbstractEntity more highly then that in Field I'm not so sure.. For example AbstractEntity uses MimeUtil.getHeaderParams() to parse Content-Type and Content-Disposition. It looks like getHeaderParams does not correctly handle comments that are allowed in quoted-strings.. Consider this example: String test = "text/plain; key=value; foo= (test) \"bar\""; Map headerParams = MimeUtil.getHeaderParams(test); System.out.println(headerParams); The result is {=text/plain, foo=(test), key=value} which is wrong. The Field equivalent.. ContentTypeField f = Fields.contentType(test); System.out.println(f.getParameters()); ..gives the correct result: {foo=bar, key=value}. But of course we could fix this and use MimeUtil.getHeaderParams() for parsing Fields, too. In fact I would not mind if most of the javacc parsers were replaced by handcrafted equivalents.. >> * parsing issues have to be addressed twice. > > to support the wider range of fields in the DOM, then yes > >> To resolve these problems I would like to propose a drastic change to the API: >> * AbstractEntity should use Field.parse to parse the raw field >> * The body descriptor can extract information from the Field >> instances without need of their own parsing code >> * A ContentHandler would be notified of a Field instead of a string >> * MessageBuilder could simply store the field and would not have to >> parse it again. >> >> Please note that with the recent changes to the API all concrete Field >> classes parse the field value lazily so there should be no significant >> performance impact. > > i have no essential objections to the strategy though i think maybe > some discussion on tactics might be worthwhile > > the pull parsing stuff only stores a limited number of headers. for > any header which isn't of interest to the user, the field is only > parsed into header and value. this is about the minimum work required > to discover the header name and so whether it's of interest. > > the typical use case for not being interested in headers is > performance. so, though creating a field is a small hit, it would be > in keeping with the spirit of the API to add something like > isFieldSkippable (but with a better name) as well as modifying > addField to accept a field. then only when the descriptor is > interested in the result would the field need to be created. the SAX > handler would then take every field in full which is fine since it's a > higher level API. Sounds reasonable.. > if this is the approach adopted then there are some improvements which > could be made to the structure of the descriptor classes. i should be > able to find time to help out with them. +1 All in all I think this is something for 0.7.. Markus