lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: special characters "ø" indexing/searching
Date Fri, 19 Nov 2010 23:45:08 GMT
On Fri, Nov 19, 2010 at 3:39 PM, Chris Hostetter
<hossman_lucene@fucit.org>wrote:

>
> : Shouldn't all ISO Latin accented characters translate one to one with
> : unaccented characters?
>
> a) i have no idea
>
> b) the OP never actually said that all their characters were in the
> ISOLAttin1 range, just that they had tried using
> ISOLatin1AccentFilterFactory, which brings up the excellent point that if
> they have other "special" characters outside of ISOLatin1 that's all hte
> more reason why they might wnat to consider using MappingCharFilterFactory
>
>
Not to mention that character filtering is liable to be more efficient than
munging the tokens.

The only proviso is that stemming and word segmentation might break if you
change characters before stemming.  I don't think that would happen in
English, French, Spanish, German, the Slavic languages that use Latin
characters or the Scandinavian languages.  I am not entirely sure about
Finnish and Hungarian.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message