db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Army <qoz...@sbcglobal.net>
Subject Initial (simple) XML support in Derby.
Date Wed, 04 May 2005 23:28:41 GMT

I recently spent some time looking at how to best incorporate a basic level of 
XML support into the Derby engine--something to serve as a starting point for 
what can hopefully grow into a larger area of development.

While I still have a good amount of code clean-up/finalization/organization to 
do, I've outlined below the XML-related functionality that I have working, and 
am wondering if anyone out there has any comments/feedback/suggestions to make 
regarding what I've done.

Here's a rough outline of what the rest of this email covers:

I)   Inital XML Support--at a high level.
II)  What I've done.
III) What I have NOT done.
IV)  Other things I plan to submit.

------------------------------------------
I)   Inital XML Support--at a high level.
------------------------------------------

In a word, what I've done is add an XML datatype to the Derby engine along with 
a few key functions that allow the use of XML in some very basic ways.

The on-disk format that I'm using is a simple textual representation of XML.  In 
other words, an XML document on disk is really just stored as a UTF-8 character 
string (similar to other JDBC string types).  That said, though, the XML 
datatype does NOT extend any of the existing character types, and is neither 
comparable to, castable to, nor storable as any other Derby built-in type.

When creating the XML datatype, I have done so in such a way as to make it 
possible to re-work the XML store to something smarter in the future--this 
textual representation is just an easy "first step" to get things rolling.

For any XML-specific operations (such as parsing), I am using Xerces and/or 
Xalan from the Apache XML Project.  For the basic features that I'm writing, 
these two projects provide us with all the functionality we need, and since they 
are part of the Apache family, we hopefully won't have to worry about licensing 
issues.

All of the XML functionality that I've written for Derby is based on the first 
(ISO approved) and second (still in development) editions of the SQL/XML 
specification.  The first edition (July 2003) is available (for purchase) here:

http://www.iso.org/iso/en/CombinedQueryResult.CombinedQueryResult?queryString=SQL%2FXML

The second addition (2004) can be found both at the above-listed URL and also here:

http://sqlx.org/SQL-XML-documents/5FCD-14-XML-2004-07.pdf

Note that the second edition is still largely developing; thus, it's quite 
possible that some of the work I've done will require syntactic changes when the 
latest edition of the second edition is more firmly established.  That's one of 
the risks of doing an early implementation.  That said, though I don't think 
this should be a problem since the main development trunk is explicitly declared 
as "NOT suitable for production" and thus syntax changes in the trunk for 
standards-compliance shouldn't conflict with any "real-life" production code. 
See here for that declaration:

http://incubator.apache.org/derby/derby_downloads.html#Development+trunk

--------------------
II)  What I've done.
--------------------

A. Created an XML type that can be both transient (SQL/XML[2003] X010) and 
persistent (SQL/XML[2003] X016).

I've added a builtin XML datatype for both transient and persistent uses.  As I 
have written it, the transient XML type (SQL/XML[2003] Feature X010) can only be 
created through use of the XMLPARSE() function, which is discussed further 
below.  The persistent datatype (SQL/XML[2003] Feature X016) can be declared as 
part of a column definition in a CREATE TABLE statement, like any other Derby 
built-in datatype.

For now, columns of type XML can only hold XML documents--XML "content" is not 
allowed, as the Xerces parser (so far as I know) cannot parse CONTENT.  Note: in 
simplified terms, "content" can have multiple elements at the root level, where 
as a "document" can only have a single element.  See the following links for 
truer definitions of what is a DOCUMENT and what is CONTENT:

http://www.w3.org/TR/REC-xml/#sec-well-formed
http://www.w3.org/TR/REC-xml/#dt-content

Once created, an XML value is neither castable to, comparable to, nor storable 
as any other Derby datatype.  In addition, comparison between two XML values is 
not permitted.  This is per section 4.2.2 of the SQL/XML (2003) specification.

B. Created an XMLPARSE function to parse XML (SQL/XML Feature X061).

A Derby XML value can be created via the XMLPARSE operator, which I've based on 
SQL/XML[2003] feature X061.

This operator takes as input an XML string and returns a Derby XML value.  The 
input string must constitute an XML document--other XML content is not 
supported.  Any and all whitespace that is included in the input string will be 
preserved by the XMLPARSE operator.

At parse time, XMLPARSE relies entirely on Xerces to check the well-formedness 
of the XML document.  Similarly, if the XML document references any external 
DTDs or XML Schema, those DTDs/schema will have to be available to the 
application that is using Derby, and it will be 100% up to Xerces to locate 
those documents for validation.  I do NOT provide any kind of internal XML 
schema repository for Derby.

If the input string expression is not a valid XML document (either because it's 
not well-formed or because it violates its schema), XMLPARSE will throw a 
SQLException with a parsing error.

C. Created an XMLSERIALIZE function to serialize an XML value into a string 
(SQL/XML[2003] Feature X071).

A user can view the contents of an XML value via the XMLSERIALIZE operator, 
which I've based on SQL/XML[2003] feature X071.

The input to this operator is a single Derby XML value, and the result is a 
string value having a type specified by the user.  If the type specified is not 
one of the existing Derby string types, XMLSERIALIZE will throw a SQLException 
with a serialization error.

If the user tries to select an XML value from a table without using the 
XMLSERIALIZE function, the result will be a SQLException; XML values will NOT be 
implicitly serialized.

D. Created an XMLEXISTS function for simple querying of XML values 
(SQL/XML[2004] Feature X096).

A user can query an XML value by using the XMLEXISTS operator, which I've based 
on SQL/XML[2004] feature X096.

The input to this operator is a single Derby XML value, and the result is a 
true/false/unknown value: true if at least one node in the target XML value 
matches the given XPath expression; false if the target XML value is non-null 
but there are zero matching nodes; and unknown if the target XML value is null.

Right now, the XMLEXISTS operator only works with XPath expressions--XQuery is 
not supported.  The set of XPath expressions that are supported by XMLEXISTS is 
the same set that is supported by Apache Xalan, as Xalan is the piece that does 
the actual querying.

--------------------------
III) What I have NOT done.
--------------------------

The following is a list of the kinds of things I have NOT done (and do not plan 
to do) for my initial submission:

1. I am NOT handling any XML-specific functionality from the JDBC side of 
things.  In other words, I'm creating an XML datatype and the corresponding SQL 
syntax/functions to use that datatype, but I am NOT offering any new JDBC 
functionality (such as XML binding) to process the datatype.  Users who wish to 
use XML features from JDBC will have to do so by executing SQL statements via 
the normal java.sql.Statement and java.sql.PreparedStatement classes.

Users are not allowed to bind to/from an XML value.  If they want to create an 
XML value, they have to prepare an XMLPARSE statement and then call setString() 
or setCharacterStream() to bind the string value.  If they want to retrieve the 
contents of an XML document, they will have to SELECT using the XMLSERIALIZE 
operator and then do a getString() on the result.

2. I am NOT creating any kind of schema repository for Derby.  Any schema 
lookup/validation is enforced entirely by Xerces.  Thus, if an XML document that 
is being inserted into Derby requires a specific DTD or XML Schema, the document 
must contain the name/URI of that schema, and it is then up to Xerces to find 
the schema and perform validation.

------------------------------------
IV)  Other things I plan to submit.
------------------------------------

After I finalize my inital patch and post it to derby-dev (which is still a 
couple of weeks out, and could be longer depending on the feedback I get from 
this email), I hope to submit a document that explains, based on my experience 
with XML, the general steps required to 1) create a new built-in datatype for 
the Derby engine, and 2) create a new built-in function for the Derby engine. 
Hopefully such a document will be useful for other Derby developers, should they 
desire to do either of these tasks themselves.

Comments/feedback would be much appreciated,
Army


Mime
View raw message