commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Kitching <>
Subject RE: [digester] mixed content update
Date Thu, 01 Apr 2004 06:23:44 GMT
On Wed, 2004-03-31 at 10:55, Edelson, Justin wrote:
> Simon - Sorry I haven't had a chance to respond to your email. I was
> actually more concentrating on answering your question about use-cases,
> but it sounds like I don't need to sell this need as much as I thought I
> did (at least for now).

Well, not to me anyway. 

It would be nice to know what some of the other Digester regulars think
though [well, that really means Robert Donkin and Craig McClanahan].
They've been involved with Digester much longer than myself.

> > If someone (eg Justin) is keen to work on this now, we could
> potentially get it in the next release. Otherwise I suggest this could
> go on the to-do list for post-1.6.
> I don't know if "keen" is the right word, but I'm committed to
> "digesting" mixed content XML for an application now, i.e. I need to
> solve this problem one way or another (or use some other XML ingesting
> mechanism, which I don't want to do for purely selfish reasons). At this
> point, I'm planning on an implementation as a subclass. Whether this
> subclass is accepted and put into the Digester release is dependent upon
> a variety of factors, but my intention is to develop a solution either
> way. I have sign-off on contributing modifications back to Apache, so
> that's not an issue.

That sounds great. Welcome to Digester development!

By the way, I looked at your bugzilla entry, but there seems to have
been some trouble with the attachments. It looks like the "core"
implementation was not attached, just some "auxillary files".

> Just to be clear, is there a timeframe for 1.6?

I am *very* keen to get a release out the door within 4 weeks.
Unfortunately, people keep coming up with really cool features to
include :-). We've been talking about a release for a while now, but
no-one else has suggested a goal date. One potential holdup is that it
would be nice for BeanUtils to release first - though it's not critical.

I'm currently shipping a commercial product with a digester library
built from CVS, and would love to get a release out to fix that - as
well as let people use all the nice features that have been added

> Let's separate the two issues in my original design - Using a special
> text designator and the specific designator used. To be honest, @text
> was really just a placeholder on my end. The only requirement I have for
> this designator is that it be an illegal XML element name so as to
> ensure that there's no conflict (i.e. if the designator was just text,
> that would pose an issue if you had an element named text). My core
> argument in favor of using a specific designator is that it explicitly
> indicates that the pattern (i.e. /element/@text) uses different
> functionality then the traditional Digester method. This is a pretty
> weak argument, I'll admit. I also feel (and can't prove yet) that this
> method is better performance-wise because the extra iterations over the
> list of rules is over a smaller list.

Well, "@" does have a special meaning in xpath. Digester's patterns
aren't xpath-compliant anyway, but it would be nice to avoid gratuitous
confusion. In xpath, "foo/@bar" means "the value of the bar attribute on
the foo element".

> As indicated above, I'm very willing to try alternate implementations,
> including the interface solution you suggested.

> My only concern about the interface solution (for lack of a better name)
> is when you wrote 'for each rule matched by the last call to
> startElement' - In my original subclassed-version, I had a call to
> super.startElement(), but in order to do what you've described, I think
> you'd need to replicate all the code in Digester.startElement() in the
> subclasses startElement() method. Otherwise, the overridden
> startElement() in the subclass would have to make an extra call to
> Rules.match(). I was originally worried about having to maintain the
> subclass's startElement method to reflect changes in the Digester
> implementation, thus the call to super.startElement(). Is this too
> dogmatic? I'm not looking to rehash the cut-and-paste vs. "eating our
> own dog food" discussion recently seen in the context of [lang].

Attached is a working solution based on my original proposal. It is
integrated into Digester rather than being a subclass. I think the
changes are simple enough, and performance impact low enough, that
integrating this into the mainline is ok. Obviously, other developers
are expected to chip in with their opinion on that....

Note that this patch assumes the one I posted earlier today (to keep a
stack of matched rules) is also accepted.

With this initial patch, there is only trivial increase in memory usage.

Regarding CPU overhead, each time an element starts or ends, it is
necessary to iterate over the list of Rules matched by its parent (which
was saved on the stack). Assuming that each tag matches 3 rules, that
means iterating over a list of 3 rules each time, and doing an
instanceof operation against the rule. 

That overhead would apply even if the user isn't interested in text
segments, but it is really pretty low compared to all the other work
that digester is doing.

But this could be optimised further by having a stack of Boolean objects
which parallels the stack of matched rules. On element start, we iterate
over the list of matched rules anyway, so could simply apply an
instanceof operation to each rule as we go, to determine if this set of
matches contains a rule implementing TextSegmentHandler, and push the
true/false state in parallel with the matched rules. Later, we can skip
the iteration over the list looking for TextSegmentHandler rules if the
corresponding flag tells us there aren't any.

Or Digester could have some kind of "enable text segment processing"
method that needs to be called if the user wants this feature, and if
not enabled then this processing is simply skipped.

Anyway, here's the code for review. All comments welcome.

And by the way, the Digester patch file includes both the mods for this
*and* the mods I posted earlier for implementing a "stack of matches"
which is really separate, and has value on its own. So the changes for
mixed-content support are less than they appear.



View raw message