Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: local policy)
Date: Tue, 11 Mar 2008 19:19:38 -0700 (PDT)
From: Chris Hostetter <hossman_lucene@fucit.org>
To: solr-user@lucene.apache.org
Subject: Re: Accented search
In-Reply-To: <3e7716cd0803111901j65a323cbta259b3b234594737@mail.gmail.com>
Message-ID: <Pine.LNX.4.62.0803111911570.25332@radix.cryptio.net>
References: <3e7716cd0803102100y22f50455s70fa3255d720d297@mail.gmail.com>
 <908893006339C0409519E4065DF3B249031884E1@mailserver.ualibrary.ualberta.ca>
 <3e7716cd0803111901j65a323cbta259b3b234594737@mail.gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Content-ID: <Pine.LNX.4.62.0803111918031.25332@radix.cryptio.net>

: It looks like a very promising approach for us. I'm going to implement 
: an custom Tokeniser based on your suggestions and see how it goes. Thank 
: you all for your comments!

you don't really need a custom tokenizer -- just a buffered TokenFilter 
that clones the original token if it contains accent chars, mutates the 
clone, and then emits it next with a positionIncrement of 0.

i'm kind of suprised ISOLatin1AccentFilter doesn't have an option to do 
this already -- it would certianly be a worthy patch to commit if someone 
wants to submit it back to lucene-java.

: > don't match the accents exactly they won't get any hits: e.g. if a word
: > contains two accented characters and the user only accents one of them in
: > their query, they won't match the accented or the unaccented version.

this could be accounted for by generating all of the permuations of 
unaccented characters when indexing -- it wouldn't solve the problem of a 
source term containing only one accent and the user quering with only one 
accent but on a different character ... you could work arround this by 
puting all of the permutations in at index time, but querying on the exact 
term and the no-accent term at query time.


-Hoss