db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bernt M. Johnsen" <Bernt.John...@Sun.COM>
Subject Re: Language based matching
Date Tue, 11 Jul 2006 09:24:34 GMT
"aa" as one letter was removed from the Norwegian language in 1938 ("å"
had been optional since 1917). It is only used in names today and it is
true what Anders says about the phonebook (also about the foreign names
where "aa" is treated like two letters). I don't think it would be wise
to not let "a.*" match "Aasen" (wich in modern writing would be Åsen).

Knut Anders Hatlen wrote:
> Kathey Marsden <kmarsdenderby@sbcglobal.net> writes:
>>Does anyone know of an easy  built in Java  mechanism for Locale
>>sensitive matching?
>>I continue to work with a user trying to develop a strategy for
>>language based string type  handling in Derby 10.1.
>>The ordering seems doable with the approach in
>>For <, =. > comparisons I was able to implement a LOCALE_COMPARE
>>function pretty easily using Collators as well,
>>but matching (LIKE replacement) seems harder.    For example  in
>>Norwegian we need to have "aa" be treated as one character and in the
>>US have it treated as two.  So given  the values acorn, aacorn, and
>>aass ( a Norwegian brewery) , and matching "a.*",  we should see three
>>rows in english and just one in   Norwegian.
> [snip]
> Hi Kathey,
> It is true that in a Norwegian phone book, Wanvik is listed before
> Waagan. However, Haas (which is not a Norwegian name) would be listed
> before Hatlen. Likewise, geographical names from other countries could
> have "aa" which should be treated as two characters in Norwegian
> (Saarland, Saarbrücken, Haag). Also, you could have composite words
> like "pizzaauksjon" (pizza auction - whatever that is) which would be
> listed before "pizzabakar" (pizza baker) in a dictionary. You could
> also have words where the stem ends with an a and the ending starts
> with an a, like "dataa" which consists of "data" (same word as in
> English) and "a" (definite article, plural, neuter).
> It is not possible to decide how "aa" should be treated without
> knowing the context, so in general I think it is best if Derby just
> treats "aa" as two characters and lets the application do the magic if
> magic is required.
> But, as many others have said, IANAL... (I Am Not A Linguist)

Bernt Marius Johnsen, Database Technology Group,
Staff Engineer, Technical Lead Derby/Java DB
Sun Microsystems, Trondheim, Norway

View raw message