lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <>
Subject Re: international stop set?
Date Sat, 27 Oct 2012 03:34:56 GMT
On Sat, Oct 27, 2012 at 1:53 PM, Tom <> wrote:
> Hello,
> using Lucene 4.0.0b, I am trying to get a superset of all stop words (for
> an international app).
> I have looked around, and not found anything specific. Is this the way to go?
> CharArraySet internationalSet = new CharArraySet(Version.LUCENE_40, 10000, false);
> internationalSet.addAll(ArabicAnalyzer.getDefaultStopSet());
> internationalSet.addAll(BulgarianAnalyzer.getDefaultStopSet());

This seems like a bad idea because you're going to eventually hit a
word which is a stop word in one language which is important for
someone in another. Even working solely in English, it didn't take us
long to find a stop word which one user actually wanted to search

For international purposes, I would just avoid using stop words.
You're going to have more than enough pain just coming up with a
sensible analysis path (advance warning: in any given language people
will complain about some feature in StandardAnalyzer.)

I assume people still recommend one field per language with a
different analyser on each, which pushes the problem to query
generation time (how the user specifies which language they're
searching for.)


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message