lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <trej...@trypticon.org>
Subject Re: searching with stemming
Date Mon, 09 Jun 2014 10:39:46 GMT
On Mon, Jun 9, 2014 at 7:57 PM, Jamie <jamie@mailarchiva.com> wrote:
> Greetings
>
> Our app currently uses language specific analysers (e.g. EnglishAnalyzer,
> GermanAnalyzer, etc.). We need an option to disable stemming. What's the
> recommended way to do this? These analyzers do not include an option to
> disable stemming, only a parameter to specify a list words for which
> stemming should not apply.
> Furthermore, my understanding is that the StandardAnalyzer is tied to
> English specifically.

I would say that StandardAnalyzer is actually a weird mix. UAX#29
(what StandardTokenizer is implementing) has rules which are not
convenient for analysing English (e.g. it doesn't break on colons nor
underscores) and ultimately if you want English-friendly tokenisation,
you should be using additional filters or customising the analyser
itself to work around these shortcomings.

Presumably EnglishAnalyzer is already working around these (or if it
isn't, it should. I don't know, because we don't use it.)

> I am trying to avoid having to override each of these analyzers with an option
> to disable stemming. Is there a better alternative?

Rather than using the Analyzer classes, we use the TokeniserFactory
and TokenFilterFactory (actually our own alternatives with the same
names - we're still on an older version of Lucene) and a single
Analyzer class which is configured by passing in the appropriate
factories.

Then there is a separate abstraction of analysis language, which takes
the stemming setting and whatever other settings you might have, and
creates the appropriate list of factories.

This way, you still get the reuse, but also gain an additional form of
backwards compatibility, since even if you change what filters are
used from version to version, you can store off what specific filters
were used to create each index.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message