lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom <>
Subject Re: international stop set?
Date Sat, 27 Oct 2012 08:43:28 GMT
On Fri, Oct 26, 2012 at 8:34 PM, Trejkaz <> wrote:

> On Sat, Oct 27, 2012 at 1:53 PM, Tom <> wrote:
> > Hello,
> >
> > using Lucene 4.0.0b, I am trying to get a superset of all stop words (for
> > an international app).
> > I have looked around, and not found anything specific. Is this the way
> to go?
> >
> > CharArraySet internationalSet = new CharArraySet(Version.LUCENE_40,
> 10000, false);
> > internationalSet.addAll(ArabicAnalyzer.getDefaultStopSet());
> > internationalSet.addAll(BulgarianAnalyzer.getDefaultStopSet());
> This seems like a bad idea because you're going to eventually hit a
> word which is a stop word in one language which is important for
> someone in another. Even working solely in English, it didn't take us
> long to find a stop word which one user actually wanted to search
> for...
> For international purposes, I would just avoid using stop words.
> You're going to have more than enough pain just coming up with a
> sensible analysis path (advance warning: in any given language people
> will complain about some feature in StandardAnalyzer.)
> I assume people still recommend one field per language with a
> different analyser on each, which pushes the problem to query
> generation time (how the user specifies which language they're
> searching for.)

Thanks TX.
Aha! Exactly the problem! And only because the user-agent is one language,
doesn't mean all search terms will be!
For example, someone might type in the name of an English event (such as
Halloween) first, and then type in the name of their home town second. See
if there are any matches of how this event is celebrated there. Very likely
that the home town will be in the native language, even if the user-agent
or the first search term isn't.

> TX
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message