Return-Path: Delivered-To: apmail-db-derby-dev-archive@www.apache.org Received: (qmail 51465 invoked from network); 26 Mar 2007 20:38:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Mar 2007 20:38:19 -0000 Received: (qmail 76749 invoked by uid 500); 26 Mar 2007 20:38:26 -0000 Delivered-To: apmail-db-derby-dev-archive@db.apache.org Received: (qmail 76723 invoked by uid 500); 26 Mar 2007 20:38:26 -0000 Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: Delivered-To: mailing list derby-dev@db.apache.org Received: (qmail 76714 invoked by uid 99); 26 Mar 2007 20:38:26 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Mar 2007 13:38:26 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: local policy) Received: from [192.18.1.36] (HELO gmp-ea-fw-1.sun.com) (192.18.1.36) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Mar 2007 13:38:17 -0700 Received: from d1-emea-09.sun.com ([192.18.2.119]) by gmp-ea-fw-1.sun.com (8.13.6+Sun/8.12.9) with ESMTP id l2QKbtGK000417 for ; Mon, 26 Mar 2007 20:37:55 GMT Received: from conversion-daemon.d1-emea-09.sun.com by d1-emea-09.sun.com (Sun Java System Messaging Server 6.2-6.01 (built Apr 3 2006)) id <0JFJ001012KA7W00@d1-emea-09.sun.com> (original mail from Bernt.Johnsen@Sun.COM) for derby-dev@db.apache.org; Mon, 26 Mar 2007 21:37:55 +0100 (BST) Received: from localhost (76.84-48-199.nextgentel.com [84.48.199.76]) by d1-emea-09.sun.com (Sun Java System Messaging Server 6.2-6.01 (built Apr 3 2006)) with ESMTPSA id <0JFJ007NL2N66150@d1-emea-09.sun.com> for derby-dev@db.apache.org; Mon, 26 Mar 2007 21:37:55 +0100 (BST) Date: Mon, 26 Mar 2007 22:37:53 +0200 From: "Bernt M. Johnsen" Subject: Re: Collation feature discussion In-reply-to: <46081DB3.9070506@sun.com> Sender: Bernt.Johnsen@Sun.COM To: derby-dev@db.apache.org Message-id: <20070326203753.GA14445@localhost.localdomain> Organization: Sun Microsystems MIME-version: 1.0 Content-type: multipart/signed; boundary=AqsLC8rIMeq19msA; protocol="application/pgp-signature"; micalg=pgp-sha1 Content-disposition: inline References: <46035E1E.9020701@apache.org> <4603DAD0.5010208@sun.com> <4607ED0E.8060805@sun.com> <4607F76D.1090200@apache.org> <46081DB3.9070506@sun.com> User-Agent: Mutt/1.5.11 X-Virus-Checked: Checked by ClamAV on apache.org --AqsLC8rIMeq19msA Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable I think that the notion of character sets is a historical issue from the time of 7 and 8-bit bytes in character strings. In that world, the notion of character sets were important for users outside the US ascii sphere. But, Derby is implemented in Java and the interface (the JDBC-drivers) to Derby is written in Java, and in Java strings and characters are Unicode, so we have no use for different character sets. The conversion of the various character sets in the world to Unicode happens in the outer interfaces of the applications using Derby and is of no concert to us as Derby developers. (Note that Java is defined to be UTF-16 which covers Unicode codepoints outside the 16-bit range too (useful for database wit hieroglyphs or cuneiform texts.... ;-)). Other Character sets than Unicode should be considered as pre-Java relics (together with NCHAR/NVARCHAR/NCLOB which in Derby should be synonyms to CHAR/VARCHAR/CLOB). We should stick to Unicode only. To do otherwise will complicate code and documentation without giving the users a better tool. Bernt >>>>>>>>>>>> Roy Lyseng wrote (2007-03-26 21:23:31): >=20 >=20 > Daniel John Debrunner wrote: > >Rick Hillegas wrote: > >>Hi Mamta, > >> > >>Thanks for this extensive write-up. This helps me puzzle through the=20 > >>issues although I'm afraid I'm still muddled. Some comments follow=20 > >>inline. > > > >Same here. > > > >>Mamta Satoor wrote: > >>>lots of good stuff ... > > > >+1 > > > >>I think we should avoid violating the SQL standard if we can. > > > >+1 > > > >In looking at this I was wondering if it would be useful if it could be= =20 > >described in terms of SQL Standard constructs how the system columns=20 > >come to have a different collation. E.g. is it as though they are=20 > >declared at the column level with a , or is it as though= =20 > >the system schemas are declared with a different character set to the=20 > >user schemas? > > > >Would this make a difference in how a string literal or other character= =20 > >expressions are collated? > > > >For string literals > > > >5.3 SR15) says the collation is the character set collation, and then=20 > >5.3 SR14b) says the character set is the the character set of the=20 > >SQL-client module that contains the . > > > >So here's it's unclear to me what 'SQL-client module' means in a derby= =20 > >context. > > > >For a function (or any other declared character type except column=20 > >definitions) the collation will come from its data type, which goes to= =20 > >6.1 SR3b) and 6.1 SR16), which says implementation defined character set. > > > >For a column definition then 11.4 SR10b) specifies the character set as= =20 > >being the schema's character set. > > > >Thus Derby could have two character sets: > > - USER - UCS repertoire with default collation of UCS_BASIC or UNICODE= =20 > >depending on value of collation JDBC attribute at create database time > > - SYSTEM - UCS repertoire with default collation of UCS_BASIC >=20 > > > >When a schema is created it is implementation defined as to its=20 > >character set (if one is not defined) 11.1 SR5) > > > >So Derby's implementation could be: > > > >user schemas have a character set of USER > >system schemas have a character set of SYSTEM > > > >Then ... > > > > - columns would have the required collation (11.4 SR10b)) > > - functions (& others) would have the required collation (Derby's=20 > >implementation could be to pick the schema character set) > > > >which leaves string literals as the issue, what is a 'SQL-client module'? > > > >Dan. > > > I think that you should carefully consider the implications of using two= =20 > character sets. Among other things, it means that two strings with=20 > different character sets are not immediately comparable. And as far as I= =20 > know, this applies to literals as well. What this means (I think) is=20 > that if columns in system tables are defined with character set SYSTEM,= =20 > columns in user-defined tables are defined with character set USER,=20 > and literals are of type USER, then you cannot immediately compare=20 > literals with the character columns in the system tables. >=20 > Another option is to use one character set, but use different collations= =20 > for different types of tables. You may define that character columns in= =20 > system tables are created using collation UCS_BASIC, while all user=20 > tables are created with a user-defined collation. Because all columns=20 > are defined using the same character set, all columns and literals will= =20 > be comparable. >=20 > Just remember that when comparing two strings with different defined=20 > collations, you need to consider the collation rules defined by the SQL= =20 > standard. >=20 > Roy --=20 Bernt Marius Johnsen, Database Technology Group,=20 Staff Engineer, Technical Lead Derby/Java DB Sun Microsystems, Trondheim, Norway --AqsLC8rIMeq19msA Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) iD8DBQFGCC8hlFBD9TXBAPARApbmAJ9gGZu0w22QjUch1yuxoV5SgwOMAwCgyh8s SDBbRe0DcRugyHWRuYXlk+8= =SxpF -----END PGP SIGNATURE----- --AqsLC8rIMeq19msA--