db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Army <qoz...@sbcglobal.net>
Subject [PATCH] Initial XML Support
Date Wed, 25 May 2005 19:16:26 GMT
Please find attached the patch for adding initial XML support to Derby.  While 
the patch _is_ over 10k lines, note that most of that comes from two XML files 
that are used in testing.

Comments/details of the patch are included below.  Quoted text is pasted from my 
initial description of the XML support I added, which can be found here:

http://article.gmane.org/gmane.comp.apache.db.derby.devel/3602

----------------------
-- Feature Description.
----------------------

 > When creating the XML datatype, I have done so in such a way as to make
 > it possible to re-work the XML store to something smarter in the
 > future--this textual representation is just an easy "first step" to get
 > things rolling.

I've organized the code so that there's a separation between the XML datatype 
and its "type implementation", where a "type implementation" defines how a 
particular XML value is read/written/processed.  Right now, the only type 
implementation I've written is a UTF8-based one that stores/reads XML just like 
other Derby string types.  More on that below.

There are three primary classes that make up the full XML datatype picture:

1) org.apache.derby.iapi.types.XMLDataValue

An interface defining the minimal methods that every XML data value should 
support.  The methods on this interface correlate to the XML operations that 
I've added--namely, XMLPARSE, XMLSERIALIZE, and XMLEXISTS.

2) org.apache.derby.iapi.types.XML

The XML datatype.  This class implements both the XMLDataValue and the DataType 
interfaces.  For all DataType operations that are common to every XML 
implementation, this XML class does the work.  For DataType operations that 
depend on the particular "XML type implementation" (see below) being used, this 
XML class simply wraps another class that handles implementation-specific 
operations.

3) org.apache.derby.impl.sql.xml.XMLImpl

This is the base class for what I call "XML type implementations" (let's call it 
"XTI" in this email, to save me the effort of typing it).  An XML type 
implementation (XTI) determines how an XML data value is to be written/read 
to/from disk, queried, and stored in memory.  The XMLImpl class defines the 
methods that every XTI (whether UTF8-based or something smarter) must implement. 
  This class is wrapped by the XML class (#2) above and is used to handle any 
DataType calls that depend directly on the XTI in use.

 > The on-disk format that I'm using is a simple textual representation of
 > XML.  In other words, an XML document on disk is really just stored as a
 > UTF-8 character string (similar to other JDBC string types).

I have created a UTF8-based XTI with the class

org.apache.derby.impl.sql.xml.XML_UTF8Impl

which extends XMLImpl.  This class takes the "easy way out" and just wraps XML 
data as an instance of SQLChar.  It reads/writes data in UTF-8, just like other 
Derby string types.  It uses the Xerces parser to parse XML data and to check 
well-formedness, and it uses the XSLT processor from Xalan to query.

This UTF8-based implementation is, of course, far from ideal.  The fact that we 
store XML data on disk as a string means that we have to re-parse it every time 
we want to query it, which has obvious performance issues.  But it was an easy 
"first step" for XML and I hope that future development can replace this with 
something smarter and faster.

In order to add a new XTI, one simply needs to create a class that extends 
"XMLImpl", implement all of the abstract methods, and then add some logic in two 
methods defined on the XMLImpl class.  The comments in that file describe what 
those methods are what the logic should be.

Note that the APIs used for XML processing are included in JDBC 3.0, and thus 
are inherently available from the 1.4.1 JVMs.  In addition, the Xerces parser 
that we use is loaded dynamically at run time, which means that the codeline 
WILL build even if Xerces doesn't exist in the classpath.  That said, though, 
since I use the Xerces parser, anyone who wishes to _use_ XML in Derby will have 
to put Xerces in his/her classpath--this is something we may want to revisit at 
a later date.  Nonetheless, if a user does NOT want to use XML, s/he does NOT 
have to have Xerces in his/her classpath--that's another benefit of loading 
Xerces dynamically: a user who uses Derby for "normal", non-XML reasons is not 
required to have any additional jars in his/her classpath.

 > All of the XML functionality that I've written for Derby is based on the
 > first (ISO approved) and second (still in development) editions of the
 > SQL/XML specification.

This is still true, and as mentioned in some earlier posts, this means that the 
*** XML syntax we use is apt to change *** (esp for the XMLEXISTS operator). 
Anyone using XML in Derby should be aware of this fact.

 > A. Created an XML type that can be both transient (SQL/XML[2003] X010)
 > and persistent (SQL/XML[2003] X016).

Completed as described in my initial email.  Ex:

ij> CREATE TABLE xTable (i INT PRIMARY KEY, x XML);
0 rows inserted/updated/deleted

 > B. Created an XMLPARSE function to parse XML (SQL/XML Feature X061).

Completed as described in my initial email, with one exception.  In my initial 
email, I mentioned that it was up to Xerces to do schema validation at parse 
time.  Since then, I realized that the SQL/XML[2003] spec explicitly states that 
XMLPARSE should NOT validate a document.  Thus, while XMLPARSE _will_ check the 
well-formedness of the document and _will_ parse any associated DTDs to load 
defaults and/or other DTD-related info, it will _not_ perform validation against 
the DTD, nor will it validate against an XML Schema Document.

Syntax is as follows:

XMLPARSE( DOCUMENT <string-value-expression> PRESERVE WHITESPACE )

Ex:

ij> INSERT INTO xTable VALUES (1, XMLPARSE(DOCUMENT '<simp> doc </simp>' 
PRESERVE WHITESPACE));
1 row inserted/updated/deleted

 > C. Created an XMLSERIALIZE function to serialize an XML value into a
 > string (SQL/XML[2003] Feature X071).

Completed as described in my initial email.  The syntax is:

XMLSERIALIZE( <xml-value-expression> AS <string-data-type> )

Ex:

ij> SELECT i, XMLSERIALIZE(x AS CHAR(20)) FROM xTable;
I          |2
--------------------------------
1          |<simp> doc </simp>

1 row selected

 > D. Created an XMLEXISTS function for simple querying of XML values
 > (SQL/XML[2004] Feature X096).

Completed as described in my initial email.  The syntax is:

XMLEXISTS( <xpath-expression> PASSING BY VALUE <xml-value-expression> )

Note, though, that this is based on the 2004 working draft of the spec, and thus 
** is susceptible to change ** in the future.

Ex:

ij> SELECT i FROM xTable where XMLEXISTS('/simp' PASSING BY VALUE x);
I
-----------
1

1 row selected

The details of all of these changes are included in the comments for the files. 
  I think I've done a pretty thorough job of commenting, but people should let 
me know if they'd like more in any particular area.

----------------------
-- Known issue.
----------------------

In my initial email, I mentioned that I was going to disallow binding to/from an 
XML parameter.  While I have this working for embedded mode, I still need to 
figure out how to enforce this in server mode.  Since the setXXX methods are 
implemented by the client, we need to look for XML parameters at statement 
preparation time and throw compile-time errors.  I was looking at this for a 
while yesterday and, oddly enough, couldn't nail it down--but hopefully I'm just 
missing something small.  Since that's the only issue that I know of with this 
patch, I thought I'd send it out and let people start reviewing it while I look 
at the binding problem.  As a result, anyone who uses the attached patch and 
then tries to bind a parameter to an XML value over the server is going to have 
problems.  But since the goal is to disallow that behavior altogether (in a 
graceful manner, of course), hopefully people can just avoid doing that until I 
have a fix...

----------------------
-- Patch details.
----------------------

Since a built-in datatype tends to affect many areas, the patch modifies a good 
number of files--but note that the changes to most of those files are pretty minor.

The total patch is over 10,000 lines, but more than half of that is the result 
of two 40k XML documents that I've added for the sake of testing.  And most of 
the rest is from new files--so no, that's not 10,000 lines of code changes ;)

I created two new directories.  This means that, since the "patch" command can't 
create directories on its own (at least, not the patch command I use), you may 
need to create the directories manually BEFORE applying the patch.  The new 
directories are:

java/engine/org/apache/derby/impl/sql/xml
java/testing/org/apache/derbyTesting/functionTests/tests/lang/xmlTestFiles

The first directory holds the "XML Type implementation" classes mentioned above, 
along with a build.xml file that is needed so that the XTIs are only built using 
JDK 1.4.  The required XML APIs aren't in JDK 1.3 or prior, so Derby will not 
support XML for 1.3.

The second directory holds a bunch of files used for XML testing.

The results from an "svn stat" are attached to this email along with the patch.

I ran the "derbylang" suite with Sun JDK 1.4.2 on Windows and all of the tests 
passed.  I haven't had a chance to run the full "derbyall" suite yet, but plan 
to do that tonight. Yes, I realize that's very important, and I certainly plan 
to do it ASAP--but I thought it'd be good to get the patch out and have people 
start looking at it.  If there are any failures in "derbyall" when I run it 
locally tonight, I will address them tomorrow.

Feedback is appreciated,
Army

Mime
View raw message