lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <trej...@trypticon.org>
Subject Re: international stop set?
Date Sat, 27 Oct 2012 03:34:56 GMT
On Sat, Oct 27, 2012 at 1:53 PM, Tom <fivemiletom@gmail.com> wrote:
> Hello,
>
> using Lucene 4.0.0b, I am trying to get a superset of all stop words (for
> an international app).
> I have looked around, and not found anything specific. Is this the way to go?
>
> CharArraySet internationalSet = new CharArraySet(Version.LUCENE_40, 10000, false);
> internationalSet.addAll(ArabicAnalyzer.getDefaultStopSet());
> internationalSet.addAll(BulgarianAnalyzer.getDefaultStopSet());

This seems like a bad idea because you're going to eventually hit a
word which is a stop word in one language which is important for
someone in another. Even working solely in English, it didn't take us
long to find a stop word which one user actually wanted to search
for...

For international purposes, I would just avoid using stop words.
You're going to have more than enough pain just coming up with a
sensible analysis path (advance warning: in any given language people
will complain about some feature in StandardAnalyzer.)

I assume people still recommend one field per language with a
different analyser on each, which pushes the problem to query
generation time (how the user specifies which language they're
searching for.)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message