Return-Path: Delivered-To: apmail-lucene-solr-user-archive@locus.apache.org Received: (qmail 23973 invoked from network); 12 Mar 2008 02:20:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Mar 2008 02:20:11 -0000 Received: (qmail 68893 invoked by uid 500); 12 Mar 2008 02:20:05 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 68866 invoked by uid 500); 12 Mar 2008 02:20:05 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 68857 invoked by uid 99); 12 Mar 2008 02:20:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Mar 2008 19:20:05 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.69.42.181] (HELO radix.cryptio.net) (208.69.42.181) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Mar 2008 02:19:18 +0000 Received: by radix.cryptio.net (Postfix, from userid 1007) id AA14F71C33B; Tue, 11 Mar 2008 19:19:38 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by radix.cryptio.net (Postfix) with ESMTP id A6DCE71C33A for ; Tue, 11 Mar 2008 19:19:38 -0700 (PDT) Date: Tue, 11 Mar 2008 19:19:38 -0700 (PDT) From: Chris Hostetter To: solr-user@lucene.apache.org Subject: Re: Accented search In-Reply-To: <3e7716cd0803111901j65a323cbta259b3b234594737@mail.gmail.com> Message-ID: References: <3e7716cd0803102100y22f50455s70fa3255d720d297@mail.gmail.com> <908893006339C0409519E4065DF3B249031884E1@mailserver.ualibrary.ualberta.ca> <3e7716cd0803111901j65a323cbta259b3b234594737@mail.gmail.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII Content-ID: X-Virus-Checked: Checked by ClamAV on apache.org : It looks like a very promising approach for us. I'm going to implement : an custom Tokeniser based on your suggestions and see how it goes. Thank : you all for your comments! you don't really need a custom tokenizer -- just a buffered TokenFilter that clones the original token if it contains accent chars, mutates the clone, and then emits it next with a positionIncrement of 0. i'm kind of suprised ISOLatin1AccentFilter doesn't have an option to do this already -- it would certianly be a worthy patch to commit if someone wants to submit it back to lucene-java. : > don't match the accents exactly they won't get any hits: e.g. if a word : > contains two accented characters and the user only accents one of them in : > their query, they won't match the accented or the unaccented version. this could be accounted for by generating all of the permuations of unaccented characters when indexing -- it wouldn't solve the problem of a source term containing only one accent and the user quering with only one accent but on a different character ... you could work arround this by puting all of the permutations in at index time, but querying on the exact term and the no-accent term at query time. -Hoss