db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Matrigali <mikem_...@sbcglobal.net>
Subject Re: Another collation question - Derby-1478 and Derby-2377
Date Wed, 16 May 2007 20:36:02 GMT


Laura Stewart wrote:
> As part of adding the new attribute collation=TERRITORY_BASED, I think
> that we need to describe how Derby handles collation.
> 
> I am trying to get my head around the best way to describe collation
> in Derby... for 10.3
> 
> In general terms, a collating sequence is a defined ordering for
> character data that determines whether a particular character sorts
> higher, lower, or the same as another character.  Each character set
> will also have a default collation.
I would also not use character set.  I would approach documenting it 
based on the behavior of datatypes rather than talk about character 
sets.  So CHAR, VARCHAR, LONG VARCHAR and CLOB comparison/ordering/like 
processing is affected.

> 
> In Derby, it is my understanding that all of our string data types are
> represented as Unicode sequences.  Is that correct?
I believe the documentation should only speak to the datatypes rather
than the underlying storage structure.  To understand current 
implementation all operations on character types use either String or
java char in memory to perform operations.  JDBC defines how one inputs
data into the datatypes and retrieves data from the datatypes.
> 
> We should have a complete list of the data types that are impacted by
> collation.
> CHAR
> VARCHAR
> CLOB ?
I believe it is
CHAR
VARCHAR
LONG VARCHAR
CLOB
> 
> Does Derby support the national character datatypes such as 
> NCHAR/NVARCHAR2?
No.
> 
> FYI - there is a feeling among some in the Internet community that the
> term "character set" is not appropriate.  They tout character code,
> character encoding, or character repertoire.
> 
> Does Derby support specifying codes?  Is that what the attribute
> territory=l_CCI (example territory=es_MX) does?
> 
> Is there a complete listing of the territories that are supported...
> maybe in a Java spec?
Hopefully mamta can expand here.  I hope that we can define our support
in terms of the standard interfaces we are using from java to perform
the ordering if a database has been defined to order based on it's
territory.

I don't believe 10.3 will change the territories supported, it is the 
same set as 10.2 (basically we support what java supports).  10.3 just
allows collation to be based on territory, all other territory support
is unchanged.
> 
> When you create a database, can you specify that the
> default character set for CHAR columns be ASCII, and the character set
> used for NCHAR be UTF8?
No there is no such thing.  We are not specifying a character set.  You
specify a teritory, this is existing functionality in 10.2.  In 10.3 you
specify at database creation time if you want collation of all user 
character data to be determined by the territory or not.  In the current
implementation it does not change the storage format, but I don't think
that should be part of the documentation.

Do not get confused by what other databases may have to include in such
a change.  Derby has always used java String/char support which is 
unicode based, so no difference is needed to operate on non-ascii 
character data.  How Derby chooses to read/write those characters to
disk is even less important for user interface documentation and could
be changed in the future.  We happen to currently use a modified UTF8
scheme (modified to support very long strings), but that is never 
exposed to a user.
> 
> The Derby documentation mentions code sets, but only with relationship
> to import/export topics or ij sessions...
right.  The 10.3 functionality does not change any of this, it only 
affects the ordering within the server.  Different operating systems,
environments may operate on different codesets outside of derby - but
once the data has gotten in (through an import, ij, jdbc) then data
is treated same on all systems.  On exit (export, ij, jdbc) the data
may then get transformed to a native codeset.  None of this is affected
by the 10.3 collation changes.
> 
> Any insite that you can provide on this would be appreciated.
> 


Mime
View raw message