lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: special characters "ø" indexing/searching
Date Sat, 20 Nov 2010 13:13:39 GMT
On Fri, Nov 19, 2010 at 6:45 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> The only proviso is that stemming and word segmentation might break if you
> change characters before stemming.  I don't think that would happen in
> English, French, Spanish, German, the Slavic languages that use Latin
> characters or the Scandinavian languages.  I am not entirely sure about
> Finnish and Hungarian.

removing accents before stemming *will* break stemmers in basically
all of these languages, depending upon the stemmer.
For the snowball stemmers especially, the rules/affix lists are
sensitive to diacritics. You can see this in the description of the
rules here (example french):
http://snowball.tartarus.org/algorithms/french/stemmer.html

I disagree with Hoss on this issue, removing diacritics in a filter is
not going to "mess up highlighting". The offsets are set by the
tokenizer. So its no different than stemming or any other process.
The *only* situation where you should use a CharFilter, is when you
must change this stuff before the tokenizer.

Mime
View raw message