lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Unicode Normalization
Date Wed, 11 Apr 2007 22:48:03 GMT

: I have encountered a problem searching in my application because of
: inconsistant unicode normalization forms in the corpus (and the
: queries). I would like to normalize to form NFKD in an analyzer (I
: think). I was thinking about creating a filter similar to the

i'm very naive to the multitudes of issues with charsets and
charencodings, but isn't the a problem best solved well when
First constructing the java String or Reader object -- either from a file
on disk or from a network socket of some kind?

or am i missunderstanding your meaning of the word Normalization?  at
first i thought you might be talking about something like the
ISOLatin1AccentFilter but then i looked at the ICU url you mentioned and
it seems to be all about byte=>character issues ... that doesn't sound
like something you would really want to be doing in an Analyzer.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message