db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel John Debrunner <...@apache.org>
Subject Re: Collation feature discussion
Date Thu, 29 Mar 2007 21:54:55 GMT
Mamta Satoor wrote:

> In Derby, there will be two character sets which will have identical 
> character repertoire (UCS) but they may have different collation 
> associated with them depending on the value of JDBC url attribute 
> COLLATION. The 2 character sets will be
> 1)USER character set - collation of UCS_BASIC/TERRITORY_BASED depending 
> on the value of jdbc url attribute COLLATION specified at create 
> database time.
> 2)SQL_IDENTIFIER character set - collation of UCS_BASIC.

In general I think this looks good, thanks for the work on this Mamta 
and a nice summary. I think there are some finer points to be decided 
but the basic design is good. This finer points are on how a collation 
is derived for certain expressions, that doesn't change the overall 
design, just the input to the collation decision making. E.g. all of 
this could be coded and then changing the collation for string literals 
would not affect the rest of the design.
> As per SQL spec, Section 11.1<schema definition>, there is an optional 
> syntax to associate a character set for a schema at create schema time. 
> Syntax Rule 5 says that if a character set is not specified by the user, 
> then the character set associated with schema is implementation defined. 
> In Derby 10.3, system schemas will be associated with SQL_IDENTIFER 
> character set and all the user schemas will be associated with USER 
> character set. Futher on, General Rule 3 specifies that the character 
> set associated with schema is used as the default character set for all 
> <column definitions>. Based on this, all the user character columns will 
> pick up the collation associated with USER character set and all the 
> system character columns will pick the up the collation associated with 
> SQL_IDENTIFIER character set.
> The character set specification for string literals is not as well 
> defined as for <column definitions> but my proposal here will work 
> within SQL spec boundaries. SQL spec Section 5.3<literal>, Syntax Rule 
> 14b says that if the character set is not specified for character string 
> literal, then character string literal's character set will be the 
> character set of the SQL-client module. Derby does not implement 
> SQL-client module, but definition of SQL-client module in Section 13.1 
> says that SQL-client module definition has mandatory <module name 
> clause> which is defined in Section 13.2 <module name clause>. The 
> Syntax Rule 4 in this section says that if a character set is not 
> specified for the SQL-client module, then it's character set is 
> implementation-defined. I think we can use this implementation-defined 
> character set for a SQL-client module to our advantage. We can define 
> Derby's implementation-defined character set for SQL-client module as 
> current schema's character set and hence the current schema's character 
> set will become string literal's character set. 

Interesting. The thing that jumped out at me is that this effectively 
means that the character set for the SQL-client module depends on the 
session's state (its current schema). That just seems strange.

I came across a couple more pieces of information from the SQL Spec that 
may or may not help. :-)

   "An SQL-session has a default character set name" that is 
implementation defined.

  "Set the default character set name for <character string literal>s in 
<preparable statement>s that are prepared
in the current SQL-session ..."

Not sure that helps, but does allow us to pick a single default 
character set for a session (though it couldn't change based upon 
current schema).

Then the SQL session also has a collation, 4.37.3 again:
    "For each character set known to the SQL-implementation, an 
SQL-session has at most one SQL-session collation
for that character set, to be used when the rules of Subclause 9.13, 
‚ÄúCollation determination‚ÄĚ, are applied. There
are no SQL-session collations at the start of an SQL-session."

but that there are "no SQL-session collations at the start" means that 
it's pointless without a <set session collation statement>, i.e. 9.13 
SR3c) does not apply to Derby.

> 3)<character string type> (SQL spec section 6.1 <data type> Syntax Rule 
> 3b and 16) - Rule 3b says that collation type of character string type 
> is the character set's collation AND rule 16 says that if <character 
> string type> is not contained in a <column definition>, then an 
> implementation-defined character set is associated with the <character 
> string type>. We can define Derby's implementation-defined character 
> set  for such <character string type> to be current schema's character 
> set. The collation derivation will be implicit.

Does 3) cover JDBC parameters as well, (ie. ?) where the type of the 
parameter is a character type?

> 6)CHAR, VARCHAR functions do not look like they are defined in the SQL 
> spec. But based on 5) [TRIM etc.] above, the result character string type's 
> collation can be considered same as the first argument's collation type 
> if the first argument to CHAR/VARCHAR function is a character string 
> type. If the first argument is not character string type, then the 
> result character string of CHAR/VARCHAR will have the same collation as 
> current schema's character set. The collation derivation will be implicit.

This approach means that CHAR(varchar_col, 20) behaves differently to
CAST (varchar_col AS CHAR(20)). Not sure if that's good or bad, but they 
might be implemented today using the same code path.


View raw message