Hi Robert,

I think the difference between KStem and other stemmers (at least those that I am aware of, like snowball or porter) is that KStem is expected to produce a real valid words and thus other filtering can be applied to the tokens after stemming more easily (for example synonym expansion). Not sure if this is the case with other available stemmers in Lucene.

Also my impression from reading the original paper by Robert Krovetz was that possibility to fine-tune lexicons is practical. So that is why I was expecting that KStem API should support this as well.

Well, may be a combination of KStem with Override filter (but applied AFTER stemming) would work too in this case :-)


On Mon, Jun 20, 2011 at 2:32 PM, Robert Muir <rcmuir@gmail.com> wrote:
On Mon, Jun 20, 2011 at 8:23 AM, Lukáš Vlček <lukas.vlcek@gmail.com> wrote:
> Hi Robert,
> this sounds interesting I will look at it in more detail.
> However, I do not think this is really a general solution. If I understand
> StemmerOverrideFilter correctly (from a quick glance) it rely on the fact
> that you *know* exact term (the key in the map) in advance. In other words
> if I wanted to "fix" some term produced by Kstem filter I would have to know
> what is the product of the stemming in advance. Now, this means that if I
> switch to snowball or porter or other stemmer instead of KStem or simply
> update something else in the filtering chain then I am in trouble. Also if I
> understand correctly the original KStem implementation it can still get
> updates to lexicons which means that once these updates are ported to Java
> implementation it can again result in problem with existing override filter
> setup.
> More generally, is there any reason why lexicons are not configurable in

Because we have StemmerOverrideFilter and KeywordMarkerFilter.

look at the source code to Kstem: it uses maps and sets of exceptions,
this is what these filters provide in a general way
(StemmerOverrideFilter being the map, and KeywordMarkerFilter being
the set).

we added these to work across the board with all lucene stemmers for
this reason.

I don't understand your concerns at all to be honest, they make no
sense to me. If we "updated" kstem or any other algorithm: it would
break whatever you are doing either way. A hashmap is a hashmap.

To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org