commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Lenz <cml...@gmx.de>
Subject [digester] [PROPOSAL] More pattern matching flexibility
Date Mon, 02 Sep 2002 13:06:41 GMT
Hello Digester developers (and everyone else),

this is a proposal to extend the matching functionality that Digester 
currently provides. Specifically, the extensions shall provide:

   - matching of elements in different namespaces
   - matching of elements depending on the existence of an attribute
   - matching of elements depending on the value of an attribute
   - matching of elements depending on their body content

To achieve this, Digester would need to undergo some changes that would 
break backwards compatibility. This proposal outlines ways to minimize 
the effects on API clients, and also outlines a 2-step process that 
includes a deprecation-oriented release in the 1.x branch, and an 
implementation of the new matching features in a 2.0 release.

But enough of the overview, let me provide some details...


                      ----- [MOTIVATION] -----

For the background, I'm one of the committers on the Slide project. 
Lately I've been experimenting with using Digester in the context of 
parsing WebDAV requests (which - as you might know - contain XML request 
bodies in many cases). The big advantage of Digester in this context is 
the pluggability of rules via rulesets - which is important because 
there are many extensions to WebDAV that I want to keep cleanly 
separated and optional.

Anyway, there are a quite a few use-cases in this scenario that hinder 
me in using Digester for those purposes. Most of them can be fixed 
without making big changes to digester, but many cannot - in particular 
the matching flexibility that is central to this proposal.

As a first example consider the body of a WebDAV PROPFIND request. It 
contains both well-known elements in the WebDAV namespace ("DAV:"), as 
well as elements that are unknown to the server but need to be 
processed. In short, the qualified names of those elements are used to 
uniquely identify resource properties/meta-data.

Example from the spec (RFC 2518):

   <?xml version="1.0" encoding="utf-8" ?>
   <D:propfind xmlns:D="DAV:">
     <D:prop xmlns:R="http://www.foo.bar/boxschema/">
       <R:bigbox/>
       <R:author/>
       <R:DingALing/>
       <R:Random/>
     </D:prop>
   </D:propfind>

To match those elements, I'd need to be able to specify a pattern like 
"D:propfind/D:prop/?:?" (where "?:?" means any element in any namespace, 
as opposed to any element in the rules namespace).

Another example in a different context:

   <?xml version="1.0" encoding="utf-8" ?>
   <D:propertybehavior xmlns:D="DAV:">
     <D:keepalive>*</D:keepalive>
   </D:propertybehavior>

  I've got a typesafe-enum-type singleton Propertybehavior.KEEPALIVE_ALL 
that is semantically equal to the entire request body. If I could 
associate a corresponding FactoryCreateRule with the pattern 
"D:propertybehavior/D:keepalive['*']", I'd have a very elegant solution.


                      ----- [PATTERN SYNTAX] -----

I've thought about what kind of constructs the matcher would need to 
support for the above use-cases, and some that I can imagine being 
useful to others.

Here's a somewhat simplified EBNF, divided into BasicPattern and 
ExtendedPattern, similar to RulesBase and ExtendedBaseRules currently in 
Digester:

BasicPattern            :=  TailMatchPattern | ExactMatchPattern
TailMatchPattern        :=  "*/" ExactMatchPattern
ExactMatchPattern       :=  { QName "/" } QName
QName                   :=  [ Prefix ":" ] LocalPart
Prefix                  :=  /* An XML Name, minus the ":" */
LocalPart               :=  /* An XML Name, minus the ":" */

ExtendedPattern         := [ "!" ] BasicPattern [ "/" WildCardQName]
                            [ "[" (AttrPredicate | TextPredicate) "]" ]
WildCardQName           := [ ( "?:" | Prefix ":" ) ] ( "?" | LocalPart )
AttrPredicate           := "@" QName [ "='" AnyText "'" ]
TextPredicate           := "'" AnyText "'"

One thing that could be further discussed is whether Digester should 
move to a syntax more similar to XPath. But that's another story...

When compared to XPath, the syntax here has serious restrictions that 
stem from the fact that Digester works on top of SAX and not a 
full-blown in memory object model. So you can only have predicates for 
the attributes or body-text of the current element. There are no 
functions, although custom Rules implementations could probably provide 
some. Etc, etc.


                        ----- [THE PROBLEM] -----

Now on to the discussion why the above isn't possible to implement as a 
simple extension to the current Digester (say in ExtendedBaseRules). The 
Rules interface has the method

   Rules.match(String namespaceURI, String pattern) - List

which is used to assemble the list of rules that match the pattern. The 
core problem is that the pattern String is the only information the 
Rules implementation has about the input document. So it can neither 
match by namespace, nor by attributes or body text.

The namespaceURI argument is of no help - it just limits the list of 
returned rules to those that match the specified namespace.


                    ----- [PROPOSED SOLUTION] -----

Forgetting about backwards compability for a moment (we'll get back to 
that later), what would be a good solution to the limitions outlined above?

My idea is to add a new class named (for example) DigesterContext to the 
game, which would replace the pattern-String (and some more). In more 
detail, it would:

   - keep a stack of fully qualified names of the "open" elements,
   - take over the bodyText stack that is currently directly in Digester,
   - hold a reference to the Attributes of the current element.

The Rules interface would have a method "match(DigesterContext context) 
- List". Then, Rules implementations would have all the above 
information (full stack of qualified names, current body text, current 
attributes) available when performing the matching. There'd also need to 
be a way of registering namespace URI / prefix pairs with the Rules 
implemention, say "addNamespace(String prefix, String namespaceURI) - void".

I'd also like to have access to the DigesterContext from the actual 
Rule. That could of course be solved by adding an accessor to Digester 
itself, although it would be cleaner to change the rule methods for a 
2.0 release.

That was the easier part ;o)


                 ----- [BACKWARDS COMPATIBILITY] -----

My biggest concern in this area is that all of the instance variables of 
Digester are protected. That makes it close to impossible to do any 
changes that are not simple additions without breaking API clients. IMHO 
there would need to be a 1.x release that deprecated direct use of at 
least some of those instance variables.

In the 2.0 release the instance variables would be either made private 
or removed (e.g. the match member, which would have been replaced by the 
DigesterContext).

As I've explained above, there would also be the need to modify the 
Rules interface. There's not much that can be done in that area, of 
course. However, providing a simple AbstractRules class (that doesn't 
implement any matching or caching like RulesBase currently does) in the 
mentioned 1.x release would have the nice effect that anyone who wanted 
to implement the Rules interface would extend AbstractRules. Then, 
modifications of the Rules interface would not be as critical as it 
seems now (but maybe I'm taking this too serious ;-)).

In the 2.0 release, the method "match(String namespaceURI, 
DigesterContext context) - List" would be added to the Rules interface. 
AbstractRules would implement this method and delegate to the legacy 
"match(String namespaceURI, String pattern) - List" method, by 
converting the DigesterContext to a String that matches what legacy 
Rules implementations would expect.

A branch after the 1.x release (or sooner) would probably be a good idea.

                        ----- [CONCLUSION] -----

I hope I've presented my ideas understandably enough, though I'm afraid 
this thing's gotten a bit long. I volunteer to implement the changes and 
additional functionality discussed above by sending patches and 
test-cases and the usual stuff.

I'd love to hear your comments...

-- 
Christopher Lenz
/=/ cmlenz at gmx.de


--
To unsubscribe, e-mail:   <mailto:commons-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:commons-dev-help@jakarta.apache.org>


Mime
View raw message