Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 54557 invoked from network); 11 Apr 2007 22:48:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Apr 2007 22:48:30 -0000 Received: (qmail 86389 invoked by uid 500); 11 Apr 2007 22:48:31 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 86033 invoked by uid 500); 11 Apr 2007 22:48:30 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 86022 invoked by uid 99); 11 Apr 2007 22:48:30 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Apr 2007 15:48:30 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (herse.apache.org: local policy) Received: from [169.229.70.167] (HELO rescomp.berkeley.edu) (169.229.70.167) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Apr 2007 15:48:23 -0700 Received: by rescomp.berkeley.edu (Postfix, from userid 1007) id 81E5E5B778; Wed, 11 Apr 2007 15:48:03 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by rescomp.berkeley.edu (Postfix) with ESMTP id 7DE057F403 for ; Wed, 11 Apr 2007 15:48:03 -0700 (PDT) Date: Wed, 11 Apr 2007 15:48:03 -0700 (PDT) From: Chris Hostetter To: java-user@lucene.apache.org Subject: Re: Unicode Normalization In-Reply-To: <461D0628020000480001068B@ntgwgate.loc.gov> Message-ID: References: <98b0a44c0704111223r171498c7g92b91a1ea6f894c1@mail.gmail.com> <461D0628020000480001068B@ntgwgate.loc.gov> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org : I have encountered a problem searching in my application because of : inconsistant unicode normalization forms in the corpus (and the : queries). I would like to normalize to form NFKD in an analyzer (I : think). I was thinking about creating a filter similar to the i'm very naive to the multitudes of issues with charsets and charencodings, but isn't the a problem best solved well when First constructing the java String or Reader object -- either from a file on disk or from a network socket of some kind? or am i missunderstanding your meaning of the word Normalization? at first i thought you might be talking about something like the ISOLatin1AccentFilter but then i looked at the ICU url you mentioned and it seems to be all about byte=>character issues ... that doesn't sound like something you would really want to be doing in an Analyzer. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org