db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bernt M. Johnsen" <Bernt.John...@Sun.COM>
Subject Re: [jira] Commented: (DERBY-2967) Single character does not match high value unicode character with collation TERRITORY_BASED
Date Mon, 08 Oct 2007 21:52:57 GMT
>>>>>>>>>>>> Mamta A. Satoor (JIRA) wrote (2007-10-04 13:25:50):
> So, the question is, in say Norwegian, what do we call "AA"? Is it a
> character or something else? Unicode specificaiton has a concept of
> text elemenets and characters
> (http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf Unicode
> chapter 2 Section 2.1 subtopic "Text Elements, Characters, and Text
> Processes". Text elements are units in a text and there are several
> kinds of text elements, some of which are grapheme
> clusters("user-perceived characters"), words, sentences
> etc. Characters are used to represent each of these different types
> of text elements. Grapheme clusters are what user perceives as a
> single character but they may or maynot be single characters
> underneath. For eg, "ch" in Slovakian is perceived by user as a
> single character (ie a grapheme cluster) but it is composed of 2
> characters "c" and "h" as 2. Another eg would be "AA" in
> Norwegian. Unicode treats "AA" as a grapheme cluster which is
> composed of 2 characters "A" and "A". (Unicode chapter 2 Figure 2.1
> and http://unicode.org/reports/tr29/ Section 1).   

The way I understand the Unicode standard, graphems and graphem
clusters are solely there for rendering while characters and combining
characters are there for text processing. Thus, we should not consider
graphemes when we are discussing SQL.

In Norwegian, there are no combining character which make up "aa" and
thus "aa" is TWO characters. However, for sorting purposes, "aa" is
one text element. For all other purposes it is two text elements. 

My conclusion here is that "aa" = "å" is false and "aa" LIKE "å" is
false too, and that CHARACTER_LENGTH("aa") always gives 2.

(REMARK: A person with the name "Håkon" may not write his name
"Haakon" and vice versa. The strings are not equal, and it is not the
same name. They are, however, of the same origin (old Norse "Hákonn" I
think), pronounced the same way and they are sorted together).

(REMARK 2 (and not relevant for this discussion): "AA" is not used in
modern Norwegian language. You will only find it in names of persons,
companies and organizations).

View raw message