db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel John Debrunner <...@apache.org>
Subject Re: Collation implementation WAS Re: Should COLLATION attribute related code go in BasicDatabase?
Date Fri, 16 Mar 2007 00:34:51 GMT
Mike Matrigali wrote:
> 
> 
> Daniel John Debrunner wrote:
>> Mike Matrigali wrote:
>>
>>>
>>>
>>> Rick Hillegas wrote:
>>>
>>>>>
>>>>> Thanks, Mike. This overhead seems pretty small to me. It's hard for 
>>>>> me to predict whether this is useful generality or over-design.
>>>>>
>>>>> In the SQL standard, collations can be declared per column. That 
>>>>> affects index descriptors. In addition, via CASTs, collations can 
>>>>> be declared per sortable expression in an ORDER BY clause. That 
>>>>> affects the sorter. I'm not the person scratching this initial 
>>>>> itch. I just want to register my instinct to design-in the 
>>>>> generality up front. I think this has two advantages:
>>>>>
>>>>> 1) It will remove an upgrade issue later on when someone wants to 
>>>>> implement more of the SQL collation support.
>>>>>
>>>>> 2) It generally lowers the barrier to implementing more of the 
>>>>> standard.
>>>>>
>>>>> Regards,
>>>>> -Rick
>>>>
>>>>
>>> I am just not sure how comfortable I feel forcing an upgrade issue on a
>>> developer for a particular feature that is not their itch.   Mamta is 
>>> trying to solve single collation database problem, not full SQL 
>>> collation support.
>>
>>
>> There's a number of factors that come in, one is the long term 
>> maintainability of the code. I think that trumps any single 
>> developer's itch. The developer can work with the community in coming 
>> up with a solution that keeps a good balance between what the 
>> community see as maintainability and scratching their itch.
>>
>> I'm actually trying to save the contributor (Mamta) work here, I think 
>> changing all the locations that generate characters to have the 
>> correct "new-character-type" is a huge amount of work and subject to 
>> errors (just from the amount of changes and interesting situations). 
>> E.g. in some situations a literal will be a CHAR (sorting by 
>> ucs_basic) and others a CHAR (sorting by locale). That decision may 
>> not be able to be made until very late in the bind time, and may not 
>> possibly even matter even thought code would have to pick one. Only 
>> caring about this when collation is involved may make it easier.
> 
> I obviously don't know "all the places", so it is not clear to me why 
> some of the places don't have to change.  It is not clear to me why one
> does not in the new proposal have to change all the locations that 
> generate characters to have the correct "new-collation-type".  I think
> this is because I dont understand the runtime usages.  Am I at least
> right about the following locations where we persist the columns.  If
> we get the right info into them when we persist them, then we can get
> the right info into them when we read them back.

Let me see if I can explain the general compile time situations I'm 
thinking about.

Assume a SQL expression where all the types are CHAR.

    f('fred', col1, col2) = col3

Now currently the bind code is going to resolve types in this order:

B1) 'fred' - CHAR
B2) col1 - CHAR
B3) col2 - CHAR
B4) f('fred', col1, col2) - CHAR
B5) col3 - CHAR
B6) result - BOOLEAN

once pass, got the right result! :-)

So with the proposed dual type system we have the two internal character 
types for CHAR:

CHAR(locale) - CHAR with collation for user columns
CHAR(ucs_basic) - CHAR collation for UCS_BASIC for system columns

Now in the proposed dual type system, the same bind ordering will occur.

Let's assume all col1 & col2 are user columns.

So the bind will result in

B1) 'fred' - unknown
B2) col1 - CHAR(locale)
B3) col2 - CHAR(locale)
B4) f('fred', col1, col2) - unknown
B5) col3 - CHAR(locale)
B6) result - unknown - don't know if types can be compared.

So come the end of bind time we haven't resolved the types,
so a second bind phase would be needed, which doesn't exist at the moment.

So what would that second phase do?

Bii1) 'fred' still unknown, no good reason to pick either type. Could 
base it on the other arguments but what if no other character arguments 
or other character arguments are a mix of CHAR(locale) and CHAR(ucs_basic)?
Bii2) f('fred', col1, col2) - unknown, don't know how to look up the 
function without a type
Bii3) result - unknown  - don't know if types can be compared.

Whoops, no progress, fail query.

Now one could introduce a partial type, or a third CHAR type - 
CHAR(unknown), which might solve the problem. First phase would be:

B1) 'fred' - CHAR(unknown)
B2) col1 - CHAR(locale)
B3) col2 - CHAR(locale)
B4) f('fred', col1, col2) - CHAR(unknown)
B5) col3 - CHAR(locale)
B6) result - BOOLEAN

Hmmmm, so I've got the right result but I've been left with a series of 
CHAR(unknown) in the tree, four options:
   1) add extra bind phases to resolve them, but I think this will fail 
for the same reasons where we didn't have CHAR(unknown).
   2) make CHAR(unknown) a first class internal type, fully supported at 
runtime and compile time.
   3) resolve them in a second bind phase to CHAR(ucs_basic), doesn't 
work because can't compare across collations
   4) resolve them in a smarter second bind phase where the unknown 
types are converted to the matching type in a collation operator and 
CHAR(ucs_basic) elsewhere.

So a possible solution, but note a third character type has been added 
for CHAR and either needs to fully implement an internal type to make 
sure it works in the compile and execution system or I need a second 
bind phase.

To me this seems like it's heading off in the direction of a hack, a 
third character internal type for CHAR? Hacks lead to bugs, bugs lead to 
the dark side :-)

-------------------------------------

So what about having collation as an attribute of a character type, then 
we have:

B1) 'fred' - CHAR collation=unknown
B2) col1 - CHAR collation=locale
B3) col2 - CHAR collation=locale
B4) f('fred', col1, col2) - CHAR collation=unknown
B5) col3 - CHAR collation=locale
B6) result - CHAR collation=locale

First pass got the right result, just like the three type CHAR(unknown) 
case. :-) What about those unknown collations, no problem, I know that 
if a collation was unknown then that information is not needed, and 
since they are just the standard CHAR type I already know that works at 
compile and execute time.

[in both the compare unknown collation to known collation the compiler 
would generate code to execute the comparison using the known collation. 
This may happen automatically using the precedence system or compiler 
inserts some promote code, which possibly goes back to the some of the 
methods I proposed earlier. ]

-----------------------------------------------
Ok, double check, let's make the expression

    f('fred', col1, syscol2) = syscol3

Of course the current code is going to bind everything to CHAR.

let's try the multi-type system with CHAR(unknown)

B1) 'fred' - CHAR(unknown)
B2) col1 - CHAR(locale)
B3) syscol2- CHAR(ucs_basic)
B4) f('fred', col1, syscol2) - CHAR(unknown)
B5) syscol3 - CHAR(ucs_basic)
B6) result - BOOLEAN

and similar for the attribute case

B1) 'fred' - CHAR collation=unknown
B2) col1 - CHAR collation=locale
B3) syscol2- CHAR collation=ucs_basic
B4) f('fred', col1, syscol2) - CHAR collation=unknown
B5) syscol3 - CHAR collation=ucs_basic
B6) result - BOOLEAN

---------------------------------------------------
and in case you are wondering why a literal such as 'fred' or a function 
return doesn't just resolve to CHAR(locale), consider these examples:

   'fred' = syscol1 - would fail, can't compare across collations, but 
database meta data queries depend on this behaviour

   'fred' = col1

---------------------------------------------------

and just for kicks

'fred' = 'barney'

Both are unknown collation types, in this case I think the result of 
collation would the default user type, collation=locale.

Sorry for the long e-mail.

Hope this is clear.
Dan.






Mime
View raw message