lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rémy Amouroux <remy.amour...@kelkoo.com>
Subject Re: How to handle umlauts ?
Date Thu, 19 Sep 2002 07:37:13 GMT
Le mer 18/09/2002 à 22:55, Ian Parkin a écrit : 
> Hello all,
> 
> I suspect my answer will involve unicode, but I'd like to make sure that I 
> am going down the right path here.
> 
> I have 100,000+ small HTML files that are mainly in the english language. I 
> just noticed that we have some user names with umlauts. These are seemingly 
> stored and searchable as the '?' character.
> 
> My code is based on the demo code that is provided with Lucene, under the 
> 'demo' directory.
> 
> I am wondering what changes I will need to make to handle such characters as 
> umlauts within english text ?

Try to use an analyzer like StandardAnalyzer in both direction: while
indexing and while searching.
I don't remember if one of the filter used by StandardAnalyzer modifies
the accented letters, if not, you will have to create one.

The idea is to transform every word to a normalized form (removing
common words, removing accents [ü => u], making the word lowercase)
before indexing the word and before searching the word. That way,
someone looking for ümlaut will have the same results than someone
looking for umlaut. (and knowing the lazyness of most of common users,
they will thank oyu to make that posisble :-)

It's quite easy to implement a subclass of
org.apache.lucene.analysis.TokenFilter that will answer to your needs
and to use it in a subclass of org.apache.lucene.analysis.Analyzer
with all the supplementary filters you need to add.

Remy

-- 
E-mail : Remy.Amouroux@kelkoo.com
Kelkoo R&D Director (http://www.kelkoo.com/)


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message