Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (herse.apache.org: local policy)
Date: Wed, 11 Apr 2007 15:48:03 -0700 (PDT)
From: Chris Hostetter <hossman_lucene@fucit.org>
To: java-user@lucene.apache.org
Subject: Re: Unicode Normalization
In-Reply-To: <461D0628020000480001068B@ntgwgate.loc.gov>
Message-ID: <Pine.LNX.4.58.0704111540180.5697@hal.rescomp.berkeley.edu>
References: <98b0a44c0704111223r171498c7g92b91a1ea6f894c1@mail.gmail.com>
 <461D0628020000480001068B@ntgwgate.loc.gov>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII


: I have encountered a problem searching in my application because of
: inconsistant unicode normalization forms in the corpus (and the
: queries). I would like to normalize to form NFKD in an analyzer (I
: think). I was thinking about creating a filter similar to the

i'm very naive to the multitudes of issues with charsets and
charencodings, but isn't the a problem best solved well when
First constructing the java String or Reader object -- either from a file
on disk or from a network socket of some kind?

or am i missunderstanding your meaning of the word Normalization?  at
first i thought you might be talking about something like the
ISOLatin1AccentFilter but then i looked at the ICU url you mentioned and
it seems to be all about byte=>character issues ... that doesn't sound
like something you would really want to be doing in an Analyzer.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org