db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mamta Satoor" <msat...@gmail.com>
Subject Re: Collation feature discussion
Date Sun, 18 Mar 2007 04:04:30 GMT
Bryan, I am out sick for last couple of days so I haven't been able to
follow the entire recent thread about collation support but let me describe
to you briefly what I have gathered so far.

Let me talk in terms of SQL datatype CHAR to keep this simple and contained.
Similar rules will apply for VARCHAR, LONG VARCHAR and CLOB datatypes.

The main issue is as far as the user is concerned, there is one CHAR
datatype when s/he defines their tables. But if they have asked for
territory based collation, then we want these CHAR datatypes to collate
differently than the CHAR datatypes that exist today in 10.2 And if the user
hasn't requested territory based collation, then we want these CHAR
datatypes to collate the way they do in 10.2 today. So, in short, the CHAR
datatype in 10.3 will have different collation behavior depending on what
user has requested. But as far as the user is concerned, they are just SQL
CHAR datatypes and not new SQL datatypes.

In my original proposal, I had proposed to introduce new internal CHAR
datatype which extended current CHAR datatype in Derby. I was proposing to
implement them by having a new format id associated with the new internal
CHAR datatypes. But with my proposal, there is overhead associated with
implementing new getter methods in DataValueFactory for this new internal
datatype and the type compiler associated with the new internal datatype
etc. The other issue with my proposal was that there are many places in the
code today where we get character datatypes and all of those cases will have
to be inidividually investigated to see which CHAR datatype implementation
they should use. So, if the character datatype is getting instantiated for
CHAR columns in system tables, then we should use existing CHAR datatype
implementation. But, if they were getting instantiated for user table, then
the new internal CHAR datatype should be instantiated. AND there will be
places where we can't determine which one of the two CHAR implementations
should we use, for eg a string value in a query 'abc'.

What Dan is proposing is that we should keep in mind that CHAR with
territory based collation differ from the CHAR with default collation in
only one aspect and ie how they are collated. Rest everything is same. So,
as long as we know at the collation time, which kind of collation we are
dealing with, we should be fine and hence there is no need to generate new
internal CHAR datatypes. Dan is proposing that at compile time, when we
associate a DataTypeDescriptor (DTD) with a char column, we tell what kind
of collation should be associated with that DTD. The collation associated
can be UCS_BASIC/territory base/unknown. Char columns associated with SYS
schemas will always have UCS_BASIC in DTD associated with them. Char columns
from user schema will have UCS_BASIC/territory based depending on what user
has requested through COLLATION attribute in the jdbc url at database create
time. Char columns that are not associated with a specific schema will have
their DTD marked with collation as unknown and later on, at the actual
collation time, for eg like method, compare methods, their collation will be
determined depending on what the other operand's collation is. If the
collation of other operand is also unknown, then the collation attribute of
such Char will default to whatever COLLATION attribute user has requested at
database create time. So, as you can see, collation information will be
saved at the column level in language layer. Store will follow the same
granularity and it will write the collation type for each and every column
in it's metadata (ie for char datatypes as well as non-char datatypes). This
collation type will make sense for only char datatypes. For the other
datatypes, collation type will be ignored.

Some of the complexity is coming from the fact that a single database can
have 2 different collations associated with it's columns, ie,  SYS schema
will always use UCS_BASIC for it's collation. But all the user schemas will
use either UCS_BASIC/territory based collation. If the collation was of only
one type for the entire database, the design/implementation would have been
far easier and we could keep collation information at database level rather
than column level.

Bryan, I might have given more information/depth than what you were looking
for but this is also helping me make sure that I am on the same page as rest
of the people involved in/following the collation discussion thread.

Dan, Mike, Rick, and others following the threads on collation, please feel
free to pitch in and add more to what I have covered or correct if I have
incorrect information.


On 3/16/07, Bryan Pendleton <bpendleton@amberpoint.com> wrote:
> I'm afraid I struggled a bit to follow all the threads
> about the new collation support. Would it be possible for
> someone to briefly summarize:
> - what the major issues are, and
> - what the primary proposals are
> thanks,
> bryan

View raw message