lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Green <>
Subject Re: Snowball and accents filter...?
Date Sat, 28 Apr 2007 12:18:58 GMT
El vie, 27-04-2007 a las 16:59 -0700, Chris Hostetter escribió:
> : In order to do this, we tried subclassing the SnowballAnalyzer... it
> : doesn't work yet, though. Here is the code of our custom class:
> At first glance, what youv'e got seems fine, can you elaborate on what you
> mean by "it doesn't work" ?
> Perhaps the issue is that the SnowballStemmer can't handle the accented
> characters, and you should strip them first, then stem?
>   public TokenStream tokenStream(String fieldName, Reader reader) {
>     TokenStream result = new StandardTokenizer(reader);
>     result = new StandardFilter(result);
>     result = new LowerCaseFilter(result);
>     if (stopSet != null)
>       result = new StopFilter(result, stopSet);
>     result = new ISOLatin1AccentFilter(result);
>     result = new SnowballFilter(result, name);
>     return result;
>   }
Thanks for your answer, Chris.

It doesn't work for the opposite reason: it requires words to be spelled
correctly, including accents, in order to stem them. So, for example,
"civilización" and its plural, "civilizaciones" are stemmed correctly,
but the accentless version, "civilizacion", doesn't get stemmed at all.
So if someone misspells the word, omitting the accent, in the search
query--a likely scenario--the only hits they get are identical
misspellings in the documents, if such things exist. But we need
stemming of both accented and unaccented versions of the word. Stemming
misspellings may sound inherently evil, I suppose, but it seems to be
our best bet.

We're currently trying to modify the SpanishStemmer to do this, but
haven't gotten it quite yet.

Another option that I'm imagining might work, though less well, would be
to simultaneously maintain two indexes, one of correctly stemmed words
generated without the accents filter, and another of unstemmed words
with the accents stripped, and query both indexes when searching.

Yet another possibility would be, I think, to silently use a dictionary
to correct spellings in queries before searching.

A few Google queries show that they do things sort of the way we're
trying to, though perhaps not quite...

Thanks again,

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message