lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukáš Vlček <>
Subject Re: KStem custom lexicons configuration possible?
Date Mon, 20 Jun 2011 12:42:49 GMT
Hi Robert,

I think the difference between KStem and other stemmers (at least those that
I am aware of, like snowball or porter) is that KStem is expected to produce
a real valid words and thus other filtering can be applied to the tokens
after stemming more easily (for example synonym expansion). Not sure if this
is the case with other available stemmers in Lucene.

Also my impression from reading the original paper by Robert Krovetz was
that possibility to fine-tune lexicons is practical. So that is why I was
expecting that KStem API should support this as well.

Well, may be a combination of KStem with Override filter (but applied AFTER
stemming) would work too in this case :-)


On Mon, Jun 20, 2011 at 2:32 PM, Robert Muir <> wrote:

> On Mon, Jun 20, 2011 at 8:23 AM, Lukáš Vlček <>
> wrote:
> > Hi Robert,
> > this sounds interesting I will look at it in more detail.
> > However, I do not think this is really a general solution. If I
> understand
> > StemmerOverrideFilter correctly (from a quick glance) it rely on the fact
> > that you *know* exact term (the key in the map) in advance. In other
> words
> > if I wanted to "fix" some term produced by Kstem filter I would have to
> know
> > what is the product of the stemming in advance. Now, this means that if I
> > switch to snowball or porter or other stemmer instead of KStem or simply
> > update something else in the filtering chain then I am in trouble. Also
> if I
> > understand correctly the original KStem implementation it can still get
> > updates to lexicons which means that once these updates are ported to
> Java
> > implementation it can again result in problem with existing override
> filter
> > setup.
> > More generally, is there any reason why lexicons are not configurable in
> Because we have StemmerOverrideFilter and KeywordMarkerFilter.
> look at the source code to Kstem: it uses maps and sets of exceptions,
> this is what these filters provide in a general way
> (StemmerOverrideFilter being the map, and KeywordMarkerFilter being
> the set).
> we added these to work across the board with all lucene stemmers for
> this reason.
> I don't understand your concerns at all to be honest, they make no
> sense to me. If we "updated" kstem or any other algorithm: it would
> break whatever you are doing either way. A hashmap is a hashmap.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message