lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: special characters "ø" indexing/searching
Date Sat, 20 Nov 2010 21:51:10 GMT
Ahh... you are right about French, and Spanish should work minimally well.
 German should be fine.

On Sat, Nov 20, 2010 at 5:13 AM, Robert Muir <rcmuir@gmail.com> wrote:

> On Fri, Nov 19, 2010 at 6:45 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> > The only proviso is that stemming and word segmentation might break if
> you
> > change characters before stemming.  I don't think that would happen in
> > English, French, Spanish, German, the Slavic languages that use Latin
> > characters or the Scandinavian languages.  I am not entirely sure about
> > Finnish and Hungarian.
>
> removing accents before stemming *will* break stemmers in basically
> all of these languages, depending upon the stemmer.
> For the snowball stemmers especially, the rules/affix lists are
> sensitive to diacritics. You can see this in the description of the
> rules here (example french):
> http://snowball.tartarus.org/algorithms/french/stemmer.html
>
> I disagree with Hoss on this issue, removing diacritics in a filter is
> not going to "mess up highlighting". The offsets are set by the
> tokenizer. So its no different than stemming or any other process.
> The *only* situation where you should use a CharFilter, is when you
> must change this stuff before the tokenizer.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message