db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Knut Anders Hatlen <Knut.Hat...@Sun.COM>
Subject Re: Language based matching
Date Mon, 10 Jul 2006 14:08:31 GMT
Kathey Marsden <kmarsdenderby@sbcglobal.net> writes:

> Does anyone know of an easy  built in Java  mechanism for Locale
> sensitive matching?
>
> I continue to work with a user trying to develop a strategy for
> language based string type  handling in Derby 10.1.
> The ordering seems doable with the approach in
> http://wiki.apache.org/db-derby/LanguageBasedOrdering
> For <, =. > comparisons I was able to implement a LOCALE_COMPARE
> function pretty easily using Collators as well,
> but matching (LIKE replacement) seems harder.    For example  in
> Norwegian we need to have "aa" be treated as one character and in the
> US have it treated as two.  So given  the values acorn, aacorn, and
> aass ( a Norwegian brewery) , and matching "a.*",  we should see three
> rows in english and just one in   Norwegian.
[snip]

Hi Kathey,

It is true that in a Norwegian phone book, Wanvik is listed before
Waagan. However, Haas (which is not a Norwegian name) would be listed
before Hatlen. Likewise, geographical names from other countries could
have "aa" which should be treated as two characters in Norwegian
(Saarland, Saarbr├╝cken, Haag). Also, you could have composite words
like "pizzaauksjon" (pizza auction - whatever that is) which would be
listed before "pizzabakar" (pizza baker) in a dictionary. You could
also have words where the stem ends with an a and the ending starts
with an a, like "dataa" which consists of "data" (same word as in
English) and "a" (definite article, plural, neuter).

It is not possible to decide how "aa" should be treated without
knowing the context, so in general I think it is best if Derby just
treats "aa" as two characters and lets the application do the magic if
magic is required.

But, as many others have said, IANAL... (I Am Not A Linguist)

-- 
Knut Anders

Mime
View raw message