db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mamta Satoor" <msat...@gmail.com>
Subject Re: Another collation question - Derby-1478 and Derby-2377
Date Tue, 22 May 2007 07:14:28 GMT
Knut, I haven't been able to spend much time on your email reponse but I
just wanted to share that Derby is not doing anything special for any
character (meaning soft hyphens or other punctuation characters). We just
rely on RuleBasedCollator provided by JVM for a given locale and we let that
RuleBasedCollator do all the comparisons (the code can be found in
org.apache.derby.iapi.types.WorkHorseForCollatorDatatypes's following
methods : 2 like methods and stringCompare method).

Mamta

On 5/18/07, Knut Anders Hatlen <Knut.Hatlen@sun.com> wrote:
>
> Mike Matrigali <mikem_app@sbcglobal.net> writes:
>
> > Thanks, I have not written the like tests yet, and am looking for
> > examples like the following where the result under the default
> > system is different under collation vs default that can be added
> > to the junit tests, but have to admit I don't know much about
> > languages other than english.
>
> I know very little about how collation is defined in the standards, but
> I would guess the trickiest part is the character sequences that map
> into a single collation element, like ch in Spanish or aa in the
> Scandinavian languages. Since I happen to know Norwegian fairly well,
> I'll try to present what I would expect, and then perhaps someone else
> could chime in and explain how/if those expectations map into the
> standards (Unicode, SQL, +++). Hopefully, this could also give you some
> ideas on how to write some meaningful tests.
>
> In Norwegian, the character sequence "aa" is to be treated as the single
> letter "ו" if it is pronounced identically to "ו". Since "a" is the
> first letter of the alphabet and "ו" the last letter of the alphabet,
> this has consequences for how words are ordered alphabetically. However,
> not all occurrences of "aa" are pronounced as "ו". In fact, today it is
> used this way more or less exclusively in family names. You won't find
> any words in a dictionary where a double a is to be pronounced as "ו",
> only in lists of names.
>
> So if you have a word like "ekstraarbeid" (an actual word found in the
> dictionary), it should be listed before "ekstrabetaling" (another actual
> word), even though aa = ו > b, because the double a is pronounced as two
> separate a's.
>
> Similarly, in the phone book, you will find "Haase" before "Hatlen" (aa
> in Haase is a long a, hence counted as two letters), but you'll find
> "Wanvik" before "Waagan" (aa in Waagan is pronounced and alphabetized as
> ו). This has some funny consequences like that the very first name in
> the phone book for Trondheim, Norway is "Aalaei", whereas the last name
> you find in it is "Aavitsland".
>
> So, my expectation is that there is some way to have a list of words
> sorted like this:
>
> Aalaei
> ekstraarbeid
> ekstrabetaling
> Haase
> Hatlen
> Wanvik
> Waagan
> Aavitsland
>
> The way these words are sorted currently with territory based collation
> and Norwegian territory is:
>
> ekstrabetaling
> ekstraarbeid
> Hatlen
> Haase
> Wanvik
> Waagan
> Aalaei
> Aavitsland
>
> I skimmed through the Unicode Collation Algorithm at
> http://unicode.org/reports/tr10/ to find out how this were to be
> handled. A paragraph under 3.1.1 Multiple Mappings said:
>
> Any character (such as soft hyphen) that is not completely ignorable
> between two characters of a contraction will cause them to sort as
> separate characters. Thus a soft hyphen can be used to separate and
> cause distinct weighting of sequences such as Slovak ch or Danish aa
> that would normally weight as units.
>
> This sounds like what I need, and placing a soft hyphen between the a's
> that I wanted to be interpreted as two single letters, did indeed give
> me the sorting order I wanted.
>
> However, even though the sorting seems to ignore the soft hyphens
> (actually, it seems to ignore all kinds of punctuation characters),
> string matching does not ignore them, so 'H_ase' does not match
> 'Ha<soft-hyphen>ase' with the LIKE predicate. Is this supposed to be
> possible, that is, to let LIKE regard 'aa' (or 'a<some-special-char>a')
> as two separate yet consecutive letters?
>
> --
> Knut Anders
>
Mime
View raw message