lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sarfaraz masood <sarfarazmasood2...@yahoo.com>
Subject how to apply stemming to the index ?
Date Fri, 02 Jul 2010 09:08:44 GMT

I want to stem the terms in my index. but currently i am using standard analyzer that is not
performing any kind of stemming. 

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);


After some searching i found a code for PorterStemAnalyzer but that is having some problems



import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.PorterStemFilter;

import java.io.Reader;
import java.util.Hashtable;


 // PorterStemAnalyzer processes input
 // text by stemming English words to their roots.
 // This Analyzer also converts the input to lower case
 // and removes stop words.  A small set of default stop
 // words is defined in the STOP_WORDS
 // array, but a caller can specify an alternative set
 // of stop words by calling non-default constructor.


public class PorterStemAnalyzer extends Analyzer
{
    private static Hashtable _stopTable;

   
     // An array containing some common English words
     // that are usually not useful for searching.
    
    public static final String[] STOP_WORDS =
    {
        "0", "1", "2", "3", "4", "5", "6", "7", "8",
        "9", "000", "$",
        "about", "after", "all", "also", "an", "and",
        "another", "any", "are", "as", "at", "be",
        "because", "been", "before", "being", "between",
        "both", "but", "by", "came", "can", "come",
        "could", "did", "do", "does", "each", "else",
        "for", "from", "get", "got", "has", "had",
        "he", "have", "her", "here", "him", "himself",
        "his", "how","if", "in", "into", "is", "it",
        "its", "just", "like", "make", "many", "me",
        "might", "more", "most", "much", "must", "my",
        "never", "now", "of", "on", "only", "or",
        "other", "our", "out", "over", "re", "said",
        "same", "see", "should", "since", "so", "some",
        "still", "such", "take", "than", "that", "the",
        "their", "them", "then", "there", "these",
        "they", "this", "those", "through", "to", "too",
        "under", "up", "use", "very", "want", "was",
        "way", "we", "well", "were", "what", "when",
        "where", "which", "while", "who", "will",
        "with", "would", "you", "your",
        "a", "b", "c", "d", "e", "f", "g", "h", "i",
        "j", "k", "l", "m", "n", "o", "p", "q", "r",
        "s", "t", "u", "v", "w", "x", "y", "z"
    };


     // Builds an analyzer.
   
    public PorterStemAnalyzer()
    {
        this(STOP_WORDS);
    }

      //Builds an analyzer with the given stop words.
     
     //@param stopWords a String array of stop words
     
    public PorterStemAnalyzer(String[] stopWords)
    {
        _stopTable = StopFilter.makeStopTable(stopWords);
    }

  
     // Processes the input by first converting it to
     // lower case, then by eliminating stop words, and
     // finally by performing Porter stemming on it.
     //
     // @param reader the Reader that
     //               provides access to the input text
     // @return an instance of TokenStream
     
    public final TokenStream tokenStream(Reader reader)
    {
        return new PorterStemFilter(
            new StopFilter(new LowerCaseTokenizer(reader),
                _stopTable));
    }
}

*Errors marked in bold.


Plz let me know if there is some alternate way to apply stemming to the index if this is 


-Sarfaraz




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message