lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jamie Johnson (Commented) (JIRA)" <>
Subject [jira] [Commented] (SOLR-3231) Add the ability to KStemmer to preserve the original token when stemming
Date Mon, 12 Mar 2012 13:02:45 GMT


Jamie Johnson commented on SOLR-3231:

this should (unless I messed it up which is possible) also produce a token for the original
term.  For instance if the term was "bricks" it should produce tokens for "bricks" and "brick".
 If that's not the case please let me know.

package org.apache.solr.analysis;


import org.apache.lucene.analysis.MockTokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * See the License for the specific language governing permissions and
 * limitations under the License.

 * Simple tests to ensure the kstem filter factory is working.
public class TestKStemFilterFactory extends BaseTokenTestCase {
  public void testStemming() throws Exception {
    Reader reader = new StringReader("bricks");
    KStemFilterFactory factory = new KStemFilterFactory();
    TokenStream stream = factory.create(new MockTokenizer(reader, MockTokenizer.WHITESPACE,
    assertTokenStreamContents(stream, new String[] { "bricks", "brick" }, new int[]{1, 0});


That is what this tests right?
> Add the ability to KStemmer to preserve the original token when stemming
> ------------------------------------------------------------------------
>                 Key: SOLR-3231
>                 URL:
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>            Reporter: Jamie Johnson
>         Attachments: KStemFilter.patch
> While using the PorterStemmer, I found that there were often times that it was far to
aggressive in it's stemming.  In my particular case it is unrealistic to provide a protected
word list which captures all possible words which should not be stemmed.  To avoid this I
proposed a solution whereby we store the original token as well as the stemmed token so exact
searches would always work.  Based on discussions on the mailing list Ahmet Arslan, I believe
the attached patch to KStemmer provides the desired capabilities through a configuration parameter.
 This largely is a copy of the org.apache.lucene.wordnet.SynonymTokenFilter.  

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message