lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Cyrillic problem
Date Mon, 01 Mar 2010 23:11:04 GMT
Hmmm, I'm nowhere near an expert on how the analyzers actually work, so I
have to
punt a bit here. And certainly take any of "the regulars" advice if they
give it <G>...

But outside of stemming, Lucene/SOLR really doesn't understand the concept
of
"language". And that's not even Lucene, it's the stemmer code. The Analyzers
are just concerned with producing tokens.

There are some special cases where, say, accents are folded. Various
European
languages have accent, grave and unaccented characters
for instance, which should all be treated as one character for a good search
experience. See IsoLatin1AccentFilter.

But as I remember (OK, it's 35 years ago that I had 2 years of Russian in
college, OK?)
the cyrillic alphabet doesn't suffer from that kind of problem, so it's
probably worth
giving it a try. At very worst, you could pre-process your indexed text and
query text
to smooth out any anomalies. If you want to dig farther, you could make your
own
analyzer.....

HTH
Erick

On Mon, Mar 1, 2010 at 4:31 PM, michaelnazaruk <michaelnazaruk@gmail.com>wrote:

>
> Thank you! And one little question:
> Can I use RussianAnalyzer  for ukrainian characters?
> --
> View this message in context:
> http://old.nabble.com/Cyrillic-problem-tp27744106p27749323.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message