db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Knut Anders Hatlen <Knut.Hat...@Sun.COM>
Subject Re: Another collation question - Derby-1478 and Derby-2377
Date Fri, 18 May 2007 23:37:56 GMT
Mike Matrigali <mikem_app@sbcglobal.net> writes:

> Thanks, I have not written the like tests yet, and am looking for
> examples like the following where the result under the default
> system is different under collation vs default that can be added
> to the junit tests, but have to admit I don't know much about
> languages other than english.

I know very little about how collation is defined in the standards, but
I would guess the trickiest part is the character sequences that map
into a single collation element, like ch in Spanish or aa in the
Scandinavian languages. Since I happen to know Norwegian fairly well,
I'll try to present what I would expect, and then perhaps someone else
could chime in and explain how/if those expectations map into the
standards (Unicode, SQL, +++). Hopefully, this could also give you some
ideas on how to write some meaningful tests.

In Norwegian, the character sequence "aa" is to be treated as the single
letter "å" if it is pronounced identically to "å". Since "a" is the
first letter of the alphabet and "å" the last letter of the alphabet,
this has consequences for how words are ordered alphabetically. However,
not all occurrences of "aa" are pronounced as "å". In fact, today it is
used this way more or less exclusively in family names. You won't find
any words in a dictionary where a double a is to be pronounced as "å",
only in lists of names.

So if you have a word like "ekstraarbeid" (an actual word found in the
dictionary), it should be listed before "ekstrabetaling" (another actual
word), even though aa = å > b, because the double a is pronounced as two
separate a's.

Similarly, in the phone book, you will find "Haase" before "Hatlen" (aa
in Haase is a long a, hence counted as two letters), but you'll find
"Wanvik" before "Waagan" (aa in Waagan is pronounced and alphabetized as
å). This has some funny consequences like that the very first name in
the phone book for Trondheim, Norway is "Aalaei", whereas the last name
you find in it is "Aavitsland".

So, my expectation is that there is some way to have a list of words
sorted like this:

Aalaei
ekstraarbeid
ekstrabetaling
Haase
Hatlen
Wanvik
Waagan
Aavitsland

The way these words are sorted currently with territory based collation
and Norwegian territory is:

ekstrabetaling      
ekstraarbeid        
Hatlen              
Haase               
Wanvik              
Waagan              
Aalaei              
Aavitsland          

I skimmed through the Unicode Collation Algorithm at
http://unicode.org/reports/tr10/ to find out how this were to be
handled. A paragraph under 3.1.1 Multiple Mappings said:

  Any character (such as soft hyphen) that is not completely ignorable
  between two characters of a contraction will cause them to sort as
  separate characters. Thus a soft hyphen can be used to separate and
  cause distinct weighting of sequences such as Slovak ch or Danish aa
  that would normally weight as units.

This sounds like what I need, and placing a soft hyphen between the a's
that I wanted to be interpreted as two single letters, did indeed give
me the sorting order I wanted.

However, even though the sorting seems to ignore the soft hyphens
(actually, it seems to ignore all kinds of punctuation characters),
string matching does not ignore them, so 'H_ase' does not match
'Ha<soft-hyphen>ase' with the LIKE predicate. Is this supposed to be
possible, that is, to let LIKE regard 'aa' (or 'a<some-special-char>a')
as two separate yet consecutive letters?

-- 
Knut Anders

Mime
View raw message