Return-Path: Delivered-To: apmail-db-derby-dev-archive@www.apache.org Received: (qmail 53220 invoked from network); 4 May 2005 23:28:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 4 May 2005 23:28:32 -0000 Received: (qmail 92239 invoked by uid 500); 4 May 2005 23:30:18 -0000 Delivered-To: apmail-db-derby-dev-archive@db.apache.org Received: (qmail 92200 invoked by uid 500); 4 May 2005 23:30:18 -0000 Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: List-Post: List-Id: Reply-To: "Derby Development" Delivered-To: mailing list derby-dev@db.apache.org Received: (qmail 92178 invoked by uid 99); 4 May 2005 23:30:18 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (hermes.apache.org: local policy) Received: from e32.co.us.ibm.com (HELO e32.co.us.ibm.com) (32.97.110.130) by apache.org (qpsmtpd/0.28) with ESMTP; Wed, 04 May 2005 16:30:17 -0700 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e32.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id j44NS40c269774 for ; Wed, 4 May 2005 19:28:04 -0400 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay04.boulder.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id j44NS4Xs257608 for ; Wed, 4 May 2005 17:28:04 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.12.11/8.13.3) with ESMTP id j44NS4xj009583 for ; Wed, 4 May 2005 17:28:04 -0600 Received: from [127.0.0.1] (Abrown.svl.ibm.com [9.30.40.148]) by d03av03.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j44NS2IH009535 for ; Wed, 4 May 2005 17:28:03 -0600 Message-ID: <42795AA9.10408@sbcglobal.net> Date: Wed, 04 May 2005 16:28:41 -0700 From: Army User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.1) Gecko/20040707 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Derby Development Subject: Initial (simple) XML support in Derby. Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I recently spent some time looking at how to best incorporate a basic level of XML support into the Derby engine--something to serve as a starting point for what can hopefully grow into a larger area of development. While I still have a good amount of code clean-up/finalization/organization to do, I've outlined below the XML-related functionality that I have working, and am wondering if anyone out there has any comments/feedback/suggestions to make regarding what I've done. Here's a rough outline of what the rest of this email covers: I) Inital XML Support--at a high level. II) What I've done. III) What I have NOT done. IV) Other things I plan to submit. ------------------------------------------ I) Inital XML Support--at a high level. ------------------------------------------ In a word, what I've done is add an XML datatype to the Derby engine along with a few key functions that allow the use of XML in some very basic ways. The on-disk format that I'm using is a simple textual representation of XML. In other words, an XML document on disk is really just stored as a UTF-8 character string (similar to other JDBC string types). That said, though, the XML datatype does NOT extend any of the existing character types, and is neither comparable to, castable to, nor storable as any other Derby built-in type. When creating the XML datatype, I have done so in such a way as to make it possible to re-work the XML store to something smarter in the future--this textual representation is just an easy "first step" to get things rolling. For any XML-specific operations (such as parsing), I am using Xerces and/or Xalan from the Apache XML Project. For the basic features that I'm writing, these two projects provide us with all the functionality we need, and since they are part of the Apache family, we hopefully won't have to worry about licensing issues. All of the XML functionality that I've written for Derby is based on the first (ISO approved) and second (still in development) editions of the SQL/XML specification. The first edition (July 2003) is available (for purchase) here: http://www.iso.org/iso/en/CombinedQueryResult.CombinedQueryResult?queryString=SQL%2FXML The second addition (2004) can be found both at the above-listed URL and also here: http://sqlx.org/SQL-XML-documents/5FCD-14-XML-2004-07.pdf Note that the second edition is still largely developing; thus, it's quite possible that some of the work I've done will require syntactic changes when the latest edition of the second edition is more firmly established. That's one of the risks of doing an early implementation. That said, though I don't think this should be a problem since the main development trunk is explicitly declared as "NOT suitable for production" and thus syntax changes in the trunk for standards-compliance shouldn't conflict with any "real-life" production code. See here for that declaration: http://incubator.apache.org/derby/derby_downloads.html#Development+trunk -------------------- II) What I've done. -------------------- A. Created an XML type that can be both transient (SQL/XML[2003] X010) and persistent (SQL/XML[2003] X016). I've added a builtin XML datatype for both transient and persistent uses. As I have written it, the transient XML type (SQL/XML[2003] Feature X010) can only be created through use of the XMLPARSE() function, which is discussed further below. The persistent datatype (SQL/XML[2003] Feature X016) can be declared as part of a column definition in a CREATE TABLE statement, like any other Derby built-in datatype. For now, columns of type XML can only hold XML documents--XML "content" is not allowed, as the Xerces parser (so far as I know) cannot parse CONTENT. Note: in simplified terms, "content" can have multiple elements at the root level, where as a "document" can only have a single element. See the following links for truer definitions of what is a DOCUMENT and what is CONTENT: http://www.w3.org/TR/REC-xml/#sec-well-formed http://www.w3.org/TR/REC-xml/#dt-content Once created, an XML value is neither castable to, comparable to, nor storable as any other Derby datatype. In addition, comparison between two XML values is not permitted. This is per section 4.2.2 of the SQL/XML (2003) specification. B. Created an XMLPARSE function to parse XML (SQL/XML Feature X061). A Derby XML value can be created via the XMLPARSE operator, which I've based on SQL/XML[2003] feature X061. This operator takes as input an XML string and returns a Derby XML value. The input string must constitute an XML document--other XML content is not supported. Any and all whitespace that is included in the input string will be preserved by the XMLPARSE operator. At parse time, XMLPARSE relies entirely on Xerces to check the well-formedness of the XML document. Similarly, if the XML document references any external DTDs or XML Schema, those DTDs/schema will have to be available to the application that is using Derby, and it will be 100% up to Xerces to locate those documents for validation. I do NOT provide any kind of internal XML schema repository for Derby. If the input string expression is not a valid XML document (either because it's not well-formed or because it violates its schema), XMLPARSE will throw a SQLException with a parsing error. C. Created an XMLSERIALIZE function to serialize an XML value into a string (SQL/XML[2003] Feature X071). A user can view the contents of an XML value via the XMLSERIALIZE operator, which I've based on SQL/XML[2003] feature X071. The input to this operator is a single Derby XML value, and the result is a string value having a type specified by the user. If the type specified is not one of the existing Derby string types, XMLSERIALIZE will throw a SQLException with a serialization error. If the user tries to select an XML value from a table without using the XMLSERIALIZE function, the result will be a SQLException; XML values will NOT be implicitly serialized. D. Created an XMLEXISTS function for simple querying of XML values (SQL/XML[2004] Feature X096). A user can query an XML value by using the XMLEXISTS operator, which I've based on SQL/XML[2004] feature X096. The input to this operator is a single Derby XML value, and the result is a true/false/unknown value: true if at least one node in the target XML value matches the given XPath expression; false if the target XML value is non-null but there are zero matching nodes; and unknown if the target XML value is null. Right now, the XMLEXISTS operator only works with XPath expressions--XQuery is not supported. The set of XPath expressions that are supported by XMLEXISTS is the same set that is supported by Apache Xalan, as Xalan is the piece that does the actual querying. -------------------------- III) What I have NOT done. -------------------------- The following is a list of the kinds of things I have NOT done (and do not plan to do) for my initial submission: 1. I am NOT handling any XML-specific functionality from the JDBC side of things. In other words, I'm creating an XML datatype and the corresponding SQL syntax/functions to use that datatype, but I am NOT offering any new JDBC functionality (such as XML binding) to process the datatype. Users who wish to use XML features from JDBC will have to do so by executing SQL statements via the normal java.sql.Statement and java.sql.PreparedStatement classes. Users are not allowed to bind to/from an XML value. If they want to create an XML value, they have to prepare an XMLPARSE statement and then call setString() or setCharacterStream() to bind the string value. If they want to retrieve the contents of an XML document, they will have to SELECT using the XMLSERIALIZE operator and then do a getString() on the result. 2. I am NOT creating any kind of schema repository for Derby. Any schema lookup/validation is enforced entirely by Xerces. Thus, if an XML document that is being inserted into Derby requires a specific DTD or XML Schema, the document must contain the name/URI of that schema, and it is then up to Xerces to find the schema and perform validation. ------------------------------------ IV) Other things I plan to submit. ------------------------------------ After I finalize my inital patch and post it to derby-dev (which is still a couple of weeks out, and could be longer depending on the feedback I get from this email), I hope to submit a document that explains, based on my experience with XML, the general steps required to 1) create a new built-in datatype for the Derby engine, and 2) create a new built-in function for the Derby engine. Hopefully such a document will be useful for other Derby developers, should they desire to do either of these tasks themselves. Comments/feedback would be much appreciated, Army