db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rick Hillegas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DERBY-6607) Derby is using territory/collation for equality, not just ordering (incorrectly?)
Date Wed, 11 Jun 2014 12:27:01 GMT

    [ https://issues.apache.org/jira/browse/DERBY-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027692#comment-14027692
] 

Rick Hillegas commented on DERBY-6607:
--------------------------------------

Hi Brett,

I agree with Knut that you will need a custom collator for this problem. It sounds like your
out-of-the-box collator isn't subtle enough for Japanese. A collator is supposed to impose
a linear order, and you are up against the anti-symmetric law of linear orders (http://en.wikipedia.org/wiki/Total_order).
You have a situation where your collator is asserting that "word1 <= word2" and "word2
<= word1". One of those assertions has to be false in order for you to get the behavior
you want.

Hope this helps,
-Rick

> Derby is using territory/collation for equality, not just ordering (incorrectly?)
> ---------------------------------------------------------------------------------
>
>                 Key: DERBY-6607
>                 URL: https://issues.apache.org/jira/browse/DERBY-6607
>             Project: Derby
>          Issue Type: Bug
>          Components: Localization
>    Affects Versions: 10.10.2.0
>            Reporter: Brett Wooldridge
>
> We have a database where we wish case-insensitivity, and therefore it was created with
collation=TERRITORY_BASED:PRIMARY.  We have customers in both the United States (en_US) and
in Japan (ja_JP).
> We have an issue in Japan.  Japanese has three character sets: hiragana, katakana, and
kanji.  Hiragana is a phonetic alphabet with 46 letters.  Katakana is an identical phonetic
alphabet with 46 letters, written using different character forms, and used for foreign words
(words adopted from other languages into Japanese).
> Here is the word 'cake' written in katakana: ケーキ (ke- ki)
> Here is the word 'cake' written in hiragana: けーき  (ke- ki)
> In terms of collation (ordering), Japanese consider these to be equal.  So, in the following
Java code, the call to 'compare()' would return 0:
> {code:java}
> Collator collator = Collator.getInstance(Locale.JAPAN);
> collator.setStrength(Collator.PRIMARY);
> return collator.compare("ケーキ", "けーき");
> {code}
> And therein lies the issue.  With respect to _ordering_ they are indeed equivalent, however
Japanese would consider them district  (non-equivalent) values.
> When a table is declared with a UNIQUE constraint on a column, or a PRIMARY KEY column,
if 'ケーキ' exists in the table, Derby will throw a unique constraint violation upon an
attempt to insert 'けーき'.
> We need collation=TERRITORY_BASED:PRIMARY or TERRITORY_BASED:SECONDARY for case-insensitivity
_and_ at the same time need these values to be treated as unique.  It is as if {{String.equals()}}
should be used if the _lvalue_ or _rvalue_ of an = operator is Japanese, but should use {{Collator.equals()}}
if both the _lvalue_ and _rvalue_ are "ascii-betical".  The same for constraint checking.
> Is it "correct" that Derby use the collation when determining value equivalency vs. ordering
equivalency?
> At the same time, I understand that this is tricky.  Japanese has no "upper-case" and
"lower-case" for hiragana, katakana, or kanji, however they do use "romanji" (roman characters)
which are essentially ASCII, which is case-sensitive.  Collation is merely used for ordering.
 So when  TERRITORY_BASED:PRIMARY/SECONDARY is used, for Japanese, 'cat' and 'CAT' would be
equivalent but 'ケーキ' and 'けーき' _would not be_.  Unfortunately, there is only one
Collator and it will identify _both_ of these as equivalent.
> Taking the example further, imagine a database with collation=TERRITORY_BASED:SECONDARY,
and _tags_ table without a unique constraint, but containing the following values:
> {code:java}
> Tag
> -----------------------
> Cat
> cat
> ケーキ
> けーき
> {code}
> The following SQL should delete both cats:
> {code:sql}
> DELETE FROM tags WHERE tag='cAT'
> {code}
> But from the Japanese perspective, the following code would _erroneously_ delete both
cakes:
> {code:sql}
> DELETE FROM tags WHERE tag='ケーキ'
> {code}
> They consider the two expressions of the word cake distinct, but consider the two cats
as equivalent.  The Collator considers them all equivalent.  It is as if {{String.equals()}}
should be used if the _lvalue_ _or_ _rvalue_ of an = operator is Japanese, and use {{Collator.equals()}}
if the _lvalue_ _and_ _rvalue_ are "ascii-betical".



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message