db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mamta Satoor" <msat...@gmail.com>
Subject Re: Collation feature discussion
Date Thu, 29 Mar 2007 07:21:34 GMT
I think I have been able to interpret SQL spec for collation type of string
literal. Let me know if that interpretation sounds correct.

I am copying some of the relevant infromation about collation from earlier
mail, so everything related to collation type determination is in one
central place and can be understood as a stand alone mail.

In Derby, there will be two character sets which will have identical
character repertoire (UCS) but they may have different collation associated
with them depending on the value of JDBC url attribute COLLATION. The 2
character sets will be
1)USER character set - collation of UCS_BASIC/TERRITORY_BASED depending on
the value of jdbc url attribute COLLATION specified at create database time.
2)SQL_IDENTIFIER character set - collation of UCS_BASIC.

As per SQL spec, Section 11.1<schema definition>, there is an optional
syntax to associate a character set for a schema at create schema time.
Syntax Rule 5 says that if a character set is not specified by the user,
then the character set associated with schema is implementation defined. In
Derby 10.3, system schemas will be associated with SQL_IDENTIFER character
set and all the user schemas will be associated with USER character set.
Futher on, General Rule 3 specifies that the character set associated with
schema is used as the default character set for all <column definitions>.
Based on this, all the user character columns will pick up the collation
associated with USER character set and all the system character columns will
pick the up the collation associated with SQL_IDENTIFIER character set.

The character set specification for string literals is not as well defined
as for <column definitions> but my proposal here will work within SQL spec
boundaries. SQL spec Section 5.3<literal>, Syntax Rule 14b says that if the
character set is not specified for character string literal, then character
string literal's character set will be the character set of the SQL-client
module. Derby does not implement SQL-client module, but definition of
SQL-client module in Section 13.1 says that SQL-client module definition has
mandatory <module name clause> which is defined in Section 13.2 <module name
clause>. The Syntax Rule 4 in this section says that if a character set is
not specified for the SQL-client module, then it's character set is
implementation-defined. I think we can use this implementation-defined
character set for a SQL-client module to our advantage. We can define
Derby's implementation-defined character set for SQL-client module as
current schema's character set and hence the current schema's character set
will become string literal's character set. Hence, if the string literal
reference is made while the current schema is user schema, then the
character set associated with string literal will be USER and if the string
literal reference is made while the current schema is system schema, then
the character set associated with the string literal will be SQL_IDENTIFER.
Going this path will also make sure that our metadata queries don't break.

Some other background information on collation derivation and collation type
from SQL spec : SQL spec talks about character string types having
collationtype and
collation derivation associated with them (SQL spec Section 4.2.2 Comparison
of character strings). If collation derivation says explicit or implicit,
then it means that there is a valid collation type associated with the
charcter string type. If the collation derivation is none, then it means
that collation type can't be established for the character string type. 1)
Collation derivation will be explicit if COLLATE clause has been used for
character string type (this is not a possibility for Derby 10.3, because we
are not planning to support SQL COLLATE clause in this release).
2)Collation derivation will be implicit if the collation can be determined
w/o the COLLATE clause eg CREATE TABLE t1(c11 char(4)) then c11 will have
collation of USER character set. Another eg, TRIM(c11) then the result
character string of TRIM operation will have collation of the operand, c11.
3)Collation derivation will be none if the aggregate methods are dealing
with character strings with different collations (Section 9.3 Data types of
results of aggregations Syntax Rule 3aii).

Following lists the collation type association for various occurences of
character string types and for operations involving character string types
in Derby 10.3
 1)String literal - collation type will be current schema's character set's
collation. The collation derivation will be implicit.
2)<column definition> (of character string type) SQL spec Section
11.1<schema definition>, General Rule 3 specifies that the character
set
associated with schema is used as the default character set for all <column
definitions>. Based on this, character columns in user tables will have USER
character set associated with them and hence collation type of
UCS_BASIC/TERRITORY_BASED. The collation derivation will be implicit.
3)<character string type> (SQL spec section 6.1 <data type> Syntax Rule 3b
and 16) - Rule 3b says that collation type of character string type is the
character set's collation AND rule 16 says that if <character string type>
is not contained in a <column definition>, then an implementation-defined
character set is associated with the <character string type>. We can define
Derby's implementation-defined character set  for such <character string
type> to be current schema's character set. The collation derivation will be
implicit.
4)<cast specification> Section 6.12 Syntax Rule 10 says that if the target
data type of CAST is character string type, then collation type of result
character string of CAST specification is the collation of the character set
of the target data type. Based on 3) above, that will be the collation of
the current schema's character set. The collation derivation will be
implicit.
 5)Result character string types from UPPER, LOWER, TRIM(LTRIM, RTRIM),
SUBSTR will have the same collation as their operand. This comes from SQL
spec Section 6.29 <string value function> Syntax Rules 8, 8, 11d, 4
respectively). The collation derivation will be implicit.
6)CHAR, VARCHAR functions do not look like they are defined in the SQL spec.
But based on 5) above, the result character string type's collation can be
considered same as the first argument's collation type if the first argument
to CHAR/VARCHAR function is a character string type. If the first argument
is not character string type, then the result character string of
CHAR/VARCHAR will have the same collation as current schema's character set.
The collation derivation will be implicit.
7)For user defined functions' that return character string type, the return
type's collation will have the same collation as current schema's character
set. The collation derivation will be implicit.
 8)For CURRENT_USER, SESSION_USER, SYSTEM_USER, SQL spec Section 6.4 Syntax
Rule 4 says that their collation type is the collation of character
set SQL_IDENTIFIER. In Derby's case, that will mean, the collation of these
functions will be UCS_BASIC. The collation derivation will be implicit.
 9)CURRENT SCHEMA and CURRENT SQLID seem to be Derby specific functions, I
didn't find them in the SQL spec. But in order to match their behavior with
the other functions covered in 8) above, their return character string
type's collation will be the collation of character set SQL_IDENTIFIER. The
collation derivation will be implicit.
 10)Aggregate operators involving all character string type
operands(Concatenation, CASE, NULLIF, COALESCE) will follow SQL spec Section
9.3 Data types of results of aggregations. In other words, if all the
operands have the same collation associated with them, then the collation of
result character string type will be same and the collation derivation will
be implicit. But if operands of different collation types are involved, then
the result character string type will have collation derivation of none.
11)And last but not least, the collation methods will follow the rules
defined by SQL spec in Section 9.13 Collation determination Syntax Rules 2
and 3e. In other words, at least one operand shall have a declared type
collation (that means if the comparison is sysChar1|userChar1 >
sysChar2|userChar2, then such a comparison will fail because both sides of >
operator have collation derivation of none after applying the rules of 10
above) AND every operand whose collation derivation is implicit shall have
the same declared type collation (that means if the comparison is sysChar1 >
userChar1WithTerritoryBasedCollation, then such a comparison will fail
because left operand has collation derivation as implicit and collation type
as UCS_BASIC and the right operand has collation derivation implicit and
collation type as TERRITORY_BASED)

 Feedback welcomed. In the mean time, I am going to start working on other
well defined aspects of this project, like creating collation aware classes
for varchar, longvarchar and clob, move the template DataValueDescriptor
creation code using format id out of the Monitor class etc.
Mamta


On 3/26/07, Rick Hillegas <Richard.Hillegas@sun.com> wrote:
>
> Daniel John Debrunner wrote:
> > Rick Hillegas wrote:
> >> Hi Mamta,
> >>
> >> Thanks for this extensive write-up. This helps me puzzle through the
> >> issues although I'm afraid I'm still muddled. Some comments follow
> >> inline.
> >
> > Same here.
> >
> >> Mamta Satoor wrote:
> >>> lots of good stuff ...
> >
> > +1
> >
> >> I think we should avoid violating the SQL standard if we can.
> >
> > +1
> >
> > In looking at this I was wondering if it would be useful if it could
> > be described in terms of SQL Standard constructs how the system
> > columns come to have a different collation. E.g. is it as though they
> > are declared at the column level with a <collate clause>, or is it as
> > though the system schemas are declared with a different character set
> > to the user schemas?
> >
> > Would this make a difference in how a string literal or other
> > character expressions are collated?
> >
> > For string literals
> >
> > 5.3 SR15) says the collation is the character set collation, and then
> > 5.3 SR14b) says the character set is the the character set of the
> > SQL-client module that contains the <character string literal>.
> Here's my interpretation of this. The standard allows you to override
> the default collation bound to a character set. You can override the
> default collation for your whole SQL session or for all of the code in
> a  SQL module. I think that SR14b is trying to describe the behavior
> you'll see if you have explicitly overridden the collation for a
> character set. My hunch would be that this clause does not apply to us
> because we don't allow collations to be overridden at these higher
> levels. For more context, search the spec for feature F693--it's the
> optional feature for allowing these higher level collation overrides.
>
>
> >
> > So here's it's unclear to me what 'SQL-client module' means in a derby
> > context.
> >
> > For a function (or any other declared character type except column
> > definitions) the collation will come from its data type, which goes to
> > 6.1 SR3b) and 6.1 SR16), which says implementation defined character
> set.
> >
> > For a column definition then 11.4 SR10b) specifies the character set
> > as being the schema's character set.
> >
> > Thus Derby could have two character sets:
> >   - USER - UCS repertoire with default collation of UCS_BASIC or
> > UNICODE depending on value of collation JDBC attribute at create
> > database time
> >  - SYSTEM - UCS repertoire with default collation of UCS_BASIC
> >
> > When a schema is created it is implementation defined as to its
> > character set (if one is not defined) 11.1 SR5)
> >
> > So Derby's implementation could be:
> >
> > user schemas have a character set of USER
> > system schemas have a character set of SYSTEM
> >
> > Then ...
> >
> >   - columns would have the required collation ( 11.4 SR10b))
> >   - functions (& others) would have the required collation (Derby's
> > implementation could be to pick the schema character set)
> >
> > which leaves string literals as the issue, what is a 'SQL-client
> module'?
> >
> > Dan.
> >
>
>

Mime
View raw message