lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Klaas" <mike.kl...@gmail.com>
Subject Re: Unicode Normalization
Date Thu, 12 Apr 2007 02:24:32 GMT
On 4/11/07, Chris Hostetter <hossman_lucene@fucit.org> wrote:
>
> : I have encountered a problem searching in my application because of
> : inconsistant unicode normalization forms in the corpus (and the
> : queries). I would like to normalize to form NFKD in an analyzer (I
> : think). I was thinking about creating a filter similar to the
>
> i'm very naive to the multitudes of issues with charsets and
> charencodings, but isn't the a problem best solved well when
> First constructing the java String or Reader object -- either from a file
> on disk or from a network socket of some kind?
>
> or am i missunderstanding your meaning of the word Normalization?  at
> first i thought you might be talking about something like the
> ISOLatin1AccentFilter but then i looked at the ICU url you mentioned and
> it seems to be all about byte=>character issues ... that doesn't sound
> like something you would really want to be doing in an Analyzer.

Unfortunately, there is a whole level of unicode "encoding" issues
above the level of byte encoding.  Unicode characters do not map
precisely to code points:  a single character can often be represented
via a single codepoint or a combination of two (surrogate pair).  I
have no idea how java's String class handles this--I doubt it does any
intelligent normalization.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message