lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vladimir Yuryev" <vyur...@rambler.ru>
Subject Patchs for RussianAnalyzer
Date Wed, 17 Mar 2004 15:02:12 GMT
Dear developers!

The user using RussianAnalyzer writes to you of Lucene. There is one 
problem at work only with it of Analyzer it is parameter of the 
Russian coding (you it know as the set of the code tables for one 
language always causes admiration). East Europe or the population the 
using applied programs in Russian use the coding windows-1251 as basic 
or widely widespread client a platform MS Windows. There is an opinion 
to update constructor without parameters establishing default 
"Cp1251".
Such updating will remove mess (for the beginners in Lucene or 
beginners of Russian) and will facilitate use Analyzers at switchings 
multilanguage search. 
#########################################################
Index: RussianAnalyzer.java
===================================================================
RCS file: 
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/analysis/ru/RussianAnalyzer.java,v
retrieving revision 1.6
diff -u -r1.6 RussianAnalyzer.java
--- RussianAnalyzer.java        12 Mar 2004 09:43:48 -0000      1.6
+++ RussianAnalyzer.java        17 Mar 2004 11:45:28 -0000
@@ -1,297 +1,318 @@
-package org.apache.lucene.analysis.ru;
-
-/* 
====================================================================
- * The Apache Software License, Version 1.1
- *
- * Copyright (c) 2001 The Apache Software Foundation.  All rights
- * reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above 
copyright
- *    notice, this list of conditions and the following disclaimer in
- *    the documentation and/or other materials provided with the
- *    distribution.
- *
- * 3. The end-user documentation included with the redistribution,
- *    if any, must include the following acknowledgment:
- *       "This product includes software developed by the
- *        Apache Software Foundation (http://www.apache.org/)."
- *    Alternately, this acknowledgment may appear in the software 
itself,
- *    if and wherever such third-party acknowledgments normally 
appear.
- *
- * 4. The names "Apache" and "Apache Software Foundation" and
- *    "Apache Lucene" must not be used to endorse or promote products
- *    derived from this software without prior written permission. 
For
- *    written permission, please contact apache@apache.org.
- *
- * 5. Products derived from this software may not be called "Apache",
- *    "Apache Lucene", nor may "Apache" appear in their name, without
- *    prior written permission of the Apache Software Foundation.
- *
- * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
- * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
- * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
- * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
- * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED 
AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
- * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- * 
====================================================================
- *
- * This software consists of voluntary contributions made by many
- * individuals on behalf of the Apache Software Foundation.  For more
- * information on the Apache Software Foundation, please see
- * <http://www.apache.org/>.
- */
-
-import org.apache.lucene.analysis.Analyzer;
-import org.apache.lucene.analysis.StopFilter;
-import org.apache.lucene.analysis.TokenStream;
-
-import java.io.Reader;
-import java.util.Hashtable;
-import java.util.Set;
-import java.util.HashSet;
-
-/**
- * Analyzer for Russian language. Supports an external list of 
stopwords (words that
- * will not be indexed at all).
- * A default set of stopwords is used unless an alternative list is 
specified.
- *
- * @author  Boris Okner, b.okner@rogers.com
- * @version $Id: RussianAnalyzer.java,v 1.6 2004/03/12 09:43:48 
ehatcher Exp $
- */
-public final class RussianAnalyzer extends Analyzer
-{
-    // letters
-    private static char A = 0;
-    private static char B = 1;
-    private static char V = 2;
-    private static char G = 3;
-    private static char D = 4;
-    private static char E = 5;
-    private static char ZH = 6;
-    private static char Z = 7;
-    private static char I = 8;
-    private static char I_ = 9;
-    private static char K = 10;
-    private static char L = 11;
-    private static char M = 12;
-    private static char N = 13;
-    private static char O = 14;
-    private static char P = 15;
-    private static char R = 16;
-    private static char S = 17;
-    private static char T = 18;
-    private static char U = 19;
-    private static char F = 20;
-    private static char X = 21;
-    private static char TS = 22;
-    private static char CH = 23;
-    private static char SH = 24;
-    private static char SHCH = 25;
-    private static char HARD = 26;
-    private static char Y = 27;
-    private static char SOFT = 28;
-    private static char AE = 29;
-    private static char IU = 30;
-    private static char IA = 31;
-
-    /**
-     * List of typical Russian stopwords.
-     */
-    private static char[][] RUSSIAN_STOP_WORDS = {
-        {A},
-        {B, E, Z},
-        {B, O, L, E, E},
-        {B, Y},
-        {B, Y, L},
-        {B, Y, L, A},
-        {B, Y, L, I},
-        {B, Y, L, O},
-        {B, Y, T, SOFT},
-        {V},
-        {V, A, M},
-        {V, A, S},
-        {V, E, S, SOFT},
-        {V, O},
-        {V, O, T},
-        {V, S, E},
-        {V, S, E, G, O},
-        {V, S, E, X},
-        {V, Y},
-        {G, D, E},
-        {D, A},
-        {D, A, ZH, E},
-        {D, L, IA},
-        {D, O},
-        {E, G, O},
-        {E, E},
-        {E, I_,},
-        {E, IU},
-        {E, S, L, I},
-        {E, S, T, SOFT},
-        {E, SHCH, E},
-        {ZH, E},
-        {Z, A},
-        {Z, D, E, S, SOFT},
-        {I},
-        {I, Z},
-        {I, L, I},
-        {I, M},
-        {I, X},
-        {K},
-        {K, A, K},
-        {K, O},
-        {K, O, G, D, A},
-        {K, T, O},
-        {L, I},
-        {L, I, B, O},
-        {M, N, E},
-        {M, O, ZH, E, T},
-        {M, Y},
-        {N, A},
-        {N, A, D, O},
-        {N, A, SH},
-        {N, E},
-        {N, E, G, O},
-        {N, E, E},
-        {N, E, T},
-        {N, I},
-        {N, I, X},
-        {N, O},
-        {N, U},
-        {O},
-        {O, B},
-        {O, D, N, A, K, O},
-        {O, N},
-        {O, N, A},
-        {O, N, I},
-        {O, N, O},
-        {O, T},
-        {O, CH, E, N, SOFT},
-        {P, O},
-        {P, O, D},
-        {P, R, I},
-        {S},
-        {S, O},
-        {T, A, K},
-        {T, A, K, ZH, E},
-        {T, A, K, O, I_},
-        {T, A, M},
-        {T, E},
-        {T, E, M},
-        {T, O},
-        {T, O, G, O},
-        {T, O, ZH, E},
-        {T, O, I_},
-        {T, O, L, SOFT, K, O},
-        {T, O, M},
-        {T, Y},
-        {U},
-        {U, ZH, E},
-        {X, O, T, IA},
-        {CH, E, G, O},
-        {CH, E, I_},
-        {CH, E, M},
-        {CH, T, O},
-        {CH, T, O, B, Y},
-        {CH, SOFT, E},
-        {CH, SOFT, IA},
-        {AE, T, A},
-        {AE, T, I},
-        {AE, T, O},
-        {IA}
-    };
-
-    /**
-     * Contains the stopwords used with the StopFilter.
-     */
-    private Set stopSet = new HashSet();
-
-    /**
-     * Charset for Russian letters.
-     * Represents encoding for 32 lowercase Russian letters.
-     * Predefined charsets can be taken from RussianCharSets class
-     */
-    private char[] charset;
-
-
-    public RussianAnalyzer() {
-        charset = RussianCharsets.UnicodeRussian;
-        stopSet = StopFilter.makeStopSet(
-                    makeStopWords(RussianCharsets.UnicodeRussian));
-    }
-
-    /**
-     * Builds an analyzer.
-     */
-    public RussianAnalyzer(char[] charset)
-    {
-        this.charset = charset;
-        stopSet = StopFilter.makeStopSet(makeStopWords(charset));
-    }
-
-    /**
-     * Builds an analyzer with the given stop words.
-     */
-    public RussianAnalyzer(char[] charset, String[] stopwords)
-    {
-        this.charset = charset;
-        stopSet = StopFilter.makeStopSet(stopwords);
-    }
-
-    // Takes russian stop words and translates them to a String 
array, using
-    // the given charset
-    private static String[] makeStopWords(char[] charset)
-    {
-        String[] res = new String[RUSSIAN_STOP_WORDS.length];
-        for (int i = 0; i < res.length; i++)
-        {
-            char[] theStopWord = RUSSIAN_STOP_WORDS[i];
-            // translate the word,using the charset
-            StringBuffer theWord = new StringBuffer();
-            for (int j = 0; j < theStopWord.length; j++)
-            {
-                theWord.append(charset[theStopWord[j]]);
-            }
-            res[i] = theWord.toString();
-        }
-        return res;
-    }
-
-    /**
-     * Builds an analyzer with the given stop words.
-     * @todo create a Set version of this ctor
-     */
-    public RussianAnalyzer(char[] charset, Hashtable stopwords)
-    {
-        this.charset = charset;
-        stopSet = new HashSet(stopwords.keySet());
-    }
-
-    /**
-     * Creates a TokenStream which tokenizes all the text in the 
provided Reader.
-     *
-     * @return  A TokenStream build from a RussianLetterTokenizer 
filtered with
-     *                  RussianLowerCaseFilter, StopFilter, and 
RussianStemFilter
-     */
-    public TokenStream tokenStream(String fieldName, Reader reader)
-    {
-        TokenStream result = new RussianLetterTokenizer(reader, 
charset);
-        result = new RussianLowerCaseFilter(result, charset);
-        result = new StopFilter(result, stopSet);
-        result = new RussianStemFilter(result, charset);
-        return result;
-    }
-}
+package org.apache.lucene.analysis.ru;
+
+/* 
====================================================================
+ * The Apache Software License, Version 1.1
+ *
+ * Copyright (c) 2001 The Apache Software Foundation.  All rights
+ * reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above 
copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *
+ * 3. The end-user documentation included with the redistribution,
+ *    if any, must include the following acknowledgment:
+ *       "This product includes software developed by the
+ *        Apache Software Foundation (http://www.apache.org/)."
+ *    Alternately, this acknowledgment may appear in the software 
itself,
+ *    if and wherever such third-party acknowledgments normally 
appear.
+ *
+ * 4. The names "Apache" and "Apache Software Foundation" and
+ *    "Apache Lucene" must not be used to endorse or promote products
+ *    derived from this software without prior written permission. 
For
+ *    written permission, please contact apache@apache.org.
+ *
+ * 5. Products derived from this software may not be called "Apache",
+ *    "Apache Lucene", nor may "Apache" appear in their name, without
+ *    prior written permission of the Apache Software Foundation.
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
+ * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
+ * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED 
AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ * 
====================================================================
+ *
+ * This software consists of voluntary contributions made by many
+ * individuals on behalf of the Apache Software Foundation.  For more
+ * information on the Apache Software Foundation, please see
+ * <http://www.apache.org/>.
+ */
+
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.StopFilter;
+import org.apache.lucene.analysis.TokenStream;
+
+import java.io.Reader;
+import java.util.Hashtable;
+import java.util.Set;
+import java.util.HashSet;
+
+/**
+ * Analyzer for Russian language. Supports an external list of 
stopwords (words that
+ * will not be indexed at all).
+ * A default set of stopwords is used unless an alternative list is 
specified.
+ *
+ * @author  Boris Okner, b.okner@rogers.com
+ * @version $Id: RussianAnalyzer.java,v 1.6 2004/03/12 09:43:48 
ehatcher Exp $
+ */
+public final class RussianAnalyzer extends Analyzer
+{
+    // letters
+    private static char A = 0;
+    private static char B = 1;
+    private static char V = 2;
+    private static char G = 3;
+    private static char D = 4;
+    private static char E = 5;
+    private static char ZH = 6;
+    private static char Z = 7;
+    private static char I = 8;
+    private static char I_ = 9;
+    private static char K = 10;
+    private static char L = 11;
+    private static char M = 12;
+    private static char N = 13;
+    private static char O = 14;
+    private static char P = 15;
+    private static char R = 16;
+    private static char S = 17;
+    private static char T = 18;
+    private static char U = 19;
+    private static char F = 20;
+    private static char X = 21;
+    private static char TS = 22;
+    private static char CH = 23;
+    private static char SH = 24;
+    private static char SHCH = 25;
+    private static char HARD = 26;
+    private static char Y = 27;
+    private static char SOFT = 28;
+    private static char AE = 29;
+    private static char IU = 30;
+    private static char IA = 31;
+
+    /**
+     * List of typical Russian stopwords.
+     */
+    private static char[][] RUSSIAN_STOP_WORDS = {
+        {A},
+        {B, E, Z},
+        {B, O, L, E, E},
+        {B, Y},
+        {B, Y, L},
+        {B, Y, L, A},
+        {B, Y, L, I},
+        {B, Y, L, O},
+        {B, Y, T, SOFT},
+        {V},
+        {V, A, M},
+        {V, A, S},
+        {V, E, S, SOFT},
+        {V, O},
+        {V, O, T},
+        {V, S, E},
+        {V, S, E, G, O},
+        {V, S, E, X},
+        {V, Y},
+        {G, D, E},
+        {D, A},
+        {D, A, ZH, E},
+        {D, L, IA},
+        {D, O},
+        {E, G, O},
+        {E, E},
+        {E, I_,},
+        {E, IU},
+        {E, S, L, I},
+        {E, S, T, SOFT},
+        {E, SHCH, E},
+        {ZH, E},
+        {Z, A},
+        {Z, D, E, S, SOFT},
+        {I},
+        {I, Z},
+        {I, L, I},
+        {I, M},
+        {I, X},
+        {K},
+        {K, A, K},
+        {K, O},
+        {K, O, G, D, A},
+        {K, T, O},
+        {L, I},
+        {L, I, B, O},
+        {M, N, E},
+        {M, O, ZH, E, T},
+        {M, Y},
+        {N, A},
+        {N, A, D, O},
+        {N, A, SH},
+        {N, E},
+        {N, E, G, O},
+        {N, E, E},
+        {N, E, T},
+        {N, I},
+        {N, I, X},
+        {N, O},
+        {N, U},
+        {O},
+        {O, B},
+        {O, D, N, A, K, O},
+        {O, N},
+        {O, N, A},
+        {O, N, I},
+        {O, N, O},
+        {O, T},
+        {O, CH, E, N, SOFT},
+        {P, O},
+        {P, O, D},
+        {P, R, I},
+        {S},
+        {S, O},
+        {T, A, K},
+        {T, A, K, ZH, E},
+        {T, A, K, O, I_},
+        {T, A, M},
+        {T, E},
+        {T, E, M},
+        {T, O},
+        {T, O, G, O},
+        {T, O, ZH, E},
+        {T, O, I_},
+        {T, O, L, SOFT, K, O},
+        {T, O, M},
+        {T, Y},
+        {U},
+        {U, ZH, E},
+        {X, O, T, IA},
+        {CH, E, G, O},
+        {CH, E, I_},
+        {CH, E, M},
+        {CH, T, O},
+        {CH, T, O, B, Y},
+        {CH, SOFT, E},
+        {CH, SOFT, IA},
+        {AE, T, A},
+        {AE, T, I},
+        {AE, T, O},
+        {IA}
+    };
+
+    /**
+     * Contains the stopwords used with the StopFilter.
+     */
+    private Set stopSet = new HashSet();
+
+    /**
+     * Charset for Russian letters.
+     * Represents encoding for 32 lowercase Russian letters.
+     * Predefined charsets can be taken from RussianCharSets class
+     */
+    private char[] charset;
+
+       /**
+        * Builds default an analyzer.
+        */
+       public RussianAnalyzer() {
+               charset = RussianCharsets.CP1251;
+               stopSet = StopFilter.makeStopSet(
+ 
                                      makeStopWords(RussianCharsets.CP1251));
+       }
+
+       /**
+        * Builds default an analyzer with the given stop words.
+        */
+       public RussianAnalyzer(String[] stopwords)
+       {
+               charset = RussianCharsets.CP1251;
+               stopSet = StopFilter.makeStopSet(stopwords);
+       }
+
+       /**
+       * Builds default an analyzer with the given stop words.
+       * @todo create a Set version of this ctor
+       */
+   public RussianAnalyzer(Hashtable stopwords)
+   {
+          charset = RussianCharsets.CP1251;
+          stopSet = new HashSet(stopwords.keySet());
+   }
+
+      /**
+     * Builds an analyzer.
+     */
+    public RussianAnalyzer(char[] charset)
+    {
+        this.charset = charset;
+        stopSet = StopFilter.makeStopSet(makeStopWords(charset));
+    }
+
+    /**
+     * Builds an analyzer with the given stop words.
+     */
+    public RussianAnalyzer(char[] charset, String[] stopwords)
+    {
+        this.charset = charset;
+        stopSet = StopFilter.makeStopSet(stopwords);
+    }
+
+    // Takes russian stop words and translates them to a String 
array, using
+    // the given charset
+    private static String[] makeStopWords(char[] charset)
+    {
+        String[] res = new String[RUSSIAN_STOP_WORDS.length];
+        for (int i = 0; i < res.length; i++)
+        {
+            char[] theStopWord = RUSSIAN_STOP_WORDS[i];
+            // translate the word,using the charset
+            StringBuffer theWord = new StringBuffer();
+            for (int j = 0; j < theStopWord.length; j++)
+            {
+                theWord.append(charset[theStopWord[j]]);
+            }
+            res[i] = theWord.toString();
+        }
+        return res;
+    }
+
+    /**
+     * Builds an analyzer with the given stop words.
+     * @todo create a Set version of this ctor
+     */
+    public RussianAnalyzer(char[] charset, Hashtable stopwords)
+    {
+        this.charset = charset;
+        stopSet = new HashSet(stopwords.keySet());
+    }
+
+    /**
+     * Creates a TokenStream which tokenizes all the text in the 
provided Reader.
+     *
+     * @return  A TokenStream build from a RussianLetterTokenizer 
filtered with
+     *                  RussianLowerCaseFilter, StopFilter, and 
RussianStemFilter
+     */
+    public TokenStream tokenStream(String fieldName, Reader reader)
+    {
+        TokenStream result = new RussianLetterTokenizer(reader, 
charset);
+        result = new RussianLowerCaseFilter(result, charset);
+        result = new StopFilter(result, stopSet);
+        result = new RussianStemFilter(result, charset);
+        return result;
+    }
+}
##########################################
Index: RussianLetterTokenizer.java
===================================================================
RCS file: 
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/analysis/ru/RussianLetterTokenizer.java,v
retrieving revision 1.2
diff -u -r1.2 RussianLetterTokenizer.java
--- RussianLetterTokenizer.java 12 Dec 2002 05:10:11 -0000      1.2
+++ RussianLetterTokenizer.java 17 Mar 2004 11:46:58 -0000
@@ -1,96 +1,102 @@
-package org.apache.lucene.analysis.ru;
-
-/* 
====================================================================
- * The Apache Software License, Version 1.1
- *
- * Copyright (c) 2001 The Apache Software Foundation.  All rights
- * reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above 
copyright
- *    notice, this list of conditions and the following disclaimer in
- *    the documentation and/or other materials provided with the
- *    distribution.
- *
- * 3. The end-user documentation included with the redistribution,
- *    if any, must include the following acknowledgment:
- *       "This product includes software developed by the
- *        Apache Software Foundation (http://www.apache.org/)."
- *    Alternately, this acknowledgment may appear in the software 
itself,
- *    if and wherever such third-party acknowledgments normally 
appear.
- *
- * 4. The names "Apache" and "Apache Software Foundation" and
- *    "Apache Lucene" must not be used to endorse or promote products
- *    derived from this software without prior written permission. 
For
- *    written permission, please contact apache@apache.org.
- *
- * 5. Products derived from this software may not be called "Apache",
- *    "Apache Lucene", nor may "Apache" appear in their name, without
- *    prior written permission of the Apache Software Foundation.
- *
- * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
- * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
- * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
- * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
- * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED 
AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
- * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- * 
====================================================================
- *
- * This software consists of voluntary contributions made by many
- * individuals on behalf of the Apache Software Foundation.  For more
- * information on the Apache Software Foundation, please see
- * <http://www.apache.org/>.
- */
-
-import java.io.Reader;
-import org.apache.lucene.analysis.CharTokenizer;
-
-/**
- * A RussianLetterTokenizer is a tokenizer that extends 
LetterTokenizer by additionally looking up letters
- * in a given "russian charset". The problem with LeterTokenizer is 
that it uses Character.isLetter() method,
- * which doesn't know how to detect letters in encodings like CP1252 
and KOI8
- * (well-known problems with 0xD7 and 0xF7 chars)
- *
- * @author  Boris Okner, b.okner@rogers.com
- * @version $Id: RussianLetterTokenizer.java,v 1.2 2002/12/12 
05:10:11 otis Exp $
- */
-
-public class RussianLetterTokenizer extends CharTokenizer
-{
-    /** Construct a new LetterTokenizer. */
-    private char[] charset;
-
-    public RussianLetterTokenizer(Reader in, char[] charset)
-    {
-        super(in);
-        this.charset = charset;
-    }
-
-    /**
-     * Collects only characters which satisfy
-     * {@link Character#isLetter(char)}.
-     */
-    protected boolean isTokenChar(char c)
-    {
-        if (Character.isLetter(c))
-            return true;
-        for (int i = 0; i < charset.length; i++)
-        {
-            if (c == charset[i])
-                return true;
-        }
-        return false;
-    }
-}
+package org.apache.lucene.analysis.ru;
+
+/* 
====================================================================
+ * The Apache Software License, Version 1.1
+ *
+ * Copyright (c) 2001 The Apache Software Foundation.  All rights
+ * reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above 
copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *
+ * 3. The end-user documentation included with the redistribution,
+ *    if any, must include the following acknowledgment:
+ *       "This product includes software developed by the
+ *        Apache Software Foundation (http://www.apache.org/)."
+ *    Alternately, this acknowledgment may appear in the software 
itself,
+ *    if and wherever such third-party acknowledgments normally 
appear.
+ *
+ * 4. The names "Apache" and "Apache Software Foundation" and
+ *    "Apache Lucene" must not be used to endorse or promote products
+ *    derived from this software without prior written permission. 
For
+ *    written permission, please contact apache@apache.org.
+ *
+ * 5. Products derived from this software may not be called "Apache",
+ *    "Apache Lucene", nor may "Apache" appear in their name, without
+ *    prior written permission of the Apache Software Foundation.
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
+ * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
+ * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED 
AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ * 
====================================================================
+ *
+ * This software consists of voluntary contributions made by many
+ * individuals on behalf of the Apache Software Foundation.  For more
+ * information on the Apache Software Foundation, please see
+ * <http://www.apache.org/>.
+ */
+
+import java.io.Reader;
+import org.apache.lucene.analysis.CharTokenizer;
+
+/**
+ * A RussianLetterTokenizer is a tokenizer that extends 
LetterTokenizer by additionally looking up letters
+ * in a given "russian charset". The problem with LeterTokenizer is 
that it uses Character.isLetter() method,
+ * which doesn't know how to detect letters in encodings like CP1252 
and KOI8
+ * (well-known problems with 0xD7 and 0xF7 chars)
+ *
+ * @author  Boris Okner, b.okner@rogers.com
+ * @version $Id: RussianLetterTokenizer.java,v 1.2 2002/12/12 
05:10:11 otis Exp $
+ */
+
+public class RussianLetterTokenizer extends CharTokenizer
+{
+    /** Construct a new LetterTokenizer. */
+    private char[] charset;
+
+       public RussianLetterTokenizer(Reader in)
+       {
+               super(in);
+               charset = RussianCharsets.CP1251;
+       }
+
+       public RussianLetterTokenizer(Reader in, char[] charset)
+       {
+               super(in);
+               this.charset = charset;
+       }
+
+    /**
+     * Collects only characters which satisfy
+     * {@link Character#isLetter(char)}.
+     */
+    protected boolean isTokenChar(char c)
+    {
+        if (Character.isLetter(c))
+            return true;
+        for (int i = 0; i < charset.length; i++)
+        {
+            if (c == charset[i])
+                return true;
+        }
+        return false;
+    }
############################################################
Index: RussianLowerCaseFilter.java
===================================================================
RCS file: 
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/analysis/ru/RussianLowerCaseFilter.java,v
retrieving revision 1.3
diff -u -r1.3 RussianLowerCaseFilter.java
--- RussianLowerCaseFilter.java 12 Dec 2002 05:10:11 -0000      1.3
+++ RussianLowerCaseFilter.java 17 Mar 2004 11:48:11 -0000
@@ -1,98 +1,104 @@
-package org.apache.lucene.analysis.ru;
-
-/* 
====================================================================
- * The Apache Software License, Version 1.1
- *
- * Copyright (c) 2001 The Apache Software Foundation.  All rights
- * reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above 
copyright
- *    notice, this list of conditions and the following disclaimer in
- *    the documentation and/or other materials provided with the
- *    distribution.
- *
- * 3. The end-user documentation included with the redistribution,
- *    if any, must include the following acknowledgment:
- *       "This product includes software developed by the
- *        Apache Software Foundation (http://www.apache.org/)."
- *    Alternately, this acknowledgment may appear in the software 
itself,
- *    if and wherever such third-party acknowledgments normally 
appear.
- *
- * 4. The names "Apache" and "Apache Software Foundation" and
- *    "Apache Lucene" must not be used to endorse or promote products
- *    derived from this software without prior written permission. 
For
- *    written permission, please contact apache@apache.org.
- *
- * 5. Products derived from this software may not be called "Apache",
- *    "Apache Lucene", nor may "Apache" appear in their name, without
- *    prior written permission of the Apache Software Foundation.
- *
- * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
- * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
- * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
- * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
- * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED 
AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
- * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- * 
====================================================================
- *
- * This software consists of voluntary contributions made by many
- * individuals on behalf of the Apache Software Foundation.  For more
- * information on the Apache Software Foundation, please see
- * <http://www.apache.org/>.
- */
-
-import org.apache.lucene.analysis.TokenFilter;
-import org.apache.lucene.analysis.Token;
-import org.apache.lucene.analysis.TokenStream;
-
-/**
- * Normalizes token text to lower case, analyzing given ("russian") 
charset.
- *
- * @author  Boris Okner, b.okner@rogers.com
- * @version $Id: RussianLowerCaseFilter.java,v 1.3 2002/12/12 
05:10:11 otis Exp $
- */
-public final class RussianLowerCaseFilter extends TokenFilter
-{
-    char[] charset;
-
-    public RussianLowerCaseFilter(TokenStream in, char[] charset)
-    {
-        super(in);
-        this.charset = charset;
-    }
-
-    public final Token next() throws java.io.IOException
-    {
-        Token t = input.next();
-
-        if (t == null)
-            return null;
-
-        String txt = t.termText();
-
-        char[] chArray = txt.toCharArray();
-        for (int i = 0; i < chArray.length; i++)
-        {
-            chArray[i] = RussianCharsets.toLowerCase(chArray[i], 
charset);
-        }
-
-        String newTxt = new String(chArray);
-        // create new token
-        Token newToken = new Token(newTxt, t.startOffset(), 
t.endOffset());
-
-        return newToken;
-    }
-}
+package org.apache.lucene.analysis.ru;
+
+/* 
====================================================================
+ * The Apache Software License, Version 1.1
+ *
+ * Copyright (c) 2001 The Apache Software Foundation.  All rights
+ * reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above 
copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *
+ * 3. The end-user documentation included with the redistribution,
+ *    if any, must include the following acknowledgment:
+ *       "This product includes software developed by the
+ *        Apache Software Foundation (http://www.apache.org/)."
+ *    Alternately, this acknowledgment may appear in the software 
itself,
+ *    if and wherever such third-party acknowledgments normally 
appear.
+ *
+ * 4. The names "Apache" and "Apache Software Foundation" and
+ *    "Apache Lucene" must not be used to endorse or promote products
+ *    derived from this software without prior written permission. 
For
+ *    written permission, please contact apache@apache.org.
+ *
+ * 5. Products derived from this software may not be called "Apache",
+ *    "Apache Lucene", nor may "Apache" appear in their name, without
+ *    prior written permission of the Apache Software Foundation.
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
+ * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
+ * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED 
AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ * 
====================================================================
+ *
+ * This software consists of voluntary contributions made by many
+ * individuals on behalf of the Apache Software Foundation.  For more
+ * information on the Apache Software Foundation, please see
+ * <http://www.apache.org/>.
+ */
+
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.Token;
+import org.apache.lucene.analysis.TokenStream;
+
+/**
+ * Normalizes token text to lower case, analyzing given ("russian") 
charset.
+ *
+ * @author  Boris Okner, b.okner@rogers.com
+ * @version $Id: RussianLowerCaseFilter.java,v 1.3 2002/12/12 
05:10:11 otis Exp $
+ */
+public final class RussianLowerCaseFilter extends TokenFilter
+{
+    char[] charset;
+
+       public RussianLowerCaseFilter(TokenStream in)
+       {
+               super(in);
+               charset = RussianCharsets.CP1251;
+       }
+
+    public RussianLowerCaseFilter(TokenStream in, char[] charset)
+    {
+        super(in);
+        this.charset = charset;
+    }
+
+    public final Token next() throws java.io.IOException
+    {
+        Token t = input.next();
+
+        if (t == null)
+            return null;
+
+        String txt = t.termText();
+
+        char[] chArray = txt.toCharArray();
+        for (int i = 0; i < chArray.length; i++)
+        {
+            chArray[i] = RussianCharsets.toLowerCase(chArray[i], 
charset);
+        }
+
+        String newTxt = new String(chArray);
+        // create new token
+        Token newToken = new Token(newTxt, t.startOffset(), 
t.endOffset());
+
+        return newToken;
+    }
+}
##########################################################
Index: RussianStemFilter.java
===================================================================
RCS file: 
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/analysis/ru/RussianStemFilter.java,v
retrieving revision 1.4
diff -u -r1.4 RussianStemFilter.java
--- RussianStemFilter.java      29 Jan 2003 17:18:53 -0000      1.4
+++ RussianStemFilter.java      17 Mar 2004 11:49:37 -0000
@@ -1,115 +1,121 @@
-package org.apache.lucene.analysis.ru;
-
-/* 
====================================================================
- * The Apache Software License, Version 1.1
- *
- * Copyright (c) 2001 The Apache Software Foundation.  All rights
- * reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above 
copyright
- *    notice, this list of conditions and the following disclaimer in
- *    the documentation and/or other materials provided with the
- *    distribution.
- *
- * 3. The end-user documentation included with the redistribution,
- *    if any, must include the following acknowledgment:
- *       "This product includes software developed by the
- *        Apache Software Foundation (http://www.apache.org/)."
- *    Alternately, this acknowledgment may appear in the software 
itself,
- *    if and wherever such third-party acknowledgments normally 
appear.
- *
- * 4. The names "Apache" and "Apache Software Foundation" and
- *    "Apache Lucene" must not be used to endorse or promote products
- *    derived from this software without prior written permission. 
For
- *    written permission, please contact apache@apache.org.
- *
- * 5. Products derived from this software may not be called "Apache",
- *    "Apache Lucene", nor may "Apache" appear in their name, without
- *    prior written permission of the Apache Software Foundation.
- *
- * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
- * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
- * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
- * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
- * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED 
AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
- * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- * 
====================================================================
- *
- * This software consists of voluntary contributions made by many
- * individuals on behalf of the Apache Software Foundation.  For more
- * information on the Apache Software Foundation, please see
- * <http://www.apache.org/>.
- */
-
-import org.apache.lucene.analysis.Token;
-import org.apache.lucene.analysis.TokenFilter;
-import org.apache.lucene.analysis.TokenStream;
-import java.io.IOException;
-
-/**
- * A filter that stems Russian words. The implementation was inspired 
by GermanStemFilter.
- * The input should be filtered by RussianLowerCaseFilter before 
passing it to RussianStemFilter ,
- * because RussianStemFilter only works  with lowercase part of any 
"russian" charset.
- *
- * @author    Boris Okner, b.okner@rogers.com
- * @version   $Id: RussianStemFilter.java,v 1.4 2003/01/29 17:18:53 
otis Exp $
- */
-public final class RussianStemFilter extends TokenFilter
-{
-    /**
-     * The actual token in the input stream.
-     */
-    private Token token = null;
-    private RussianStemmer stemmer = null;
-
-    public RussianStemFilter(TokenStream in, char[] charset)
-    {
-        super(in);
-        stemmer = new RussianStemmer(charset);
-    }
-
-    /**
-     * @return  Returns the next token in the stream, or null at EOS
-     */
-    public final Token next() throws IOException
-    {
-        if ((token = input.next()) == null)
-        {
-            return null;
-        }
-        else
-        {
-            String s = stemmer.stem(token.termText());
-            if (!s.equals(token.termText()))
-            {
-                return new Token(s, token.startOffset(), 
token.endOffset(),
-                    token.type());
-            }
-            return token;
-        }
-    }
-
-    /**
-     * Set a alternative/custom RussianStemmer for this filter.
-     */
-    public void setStemmer(RussianStemmer stemmer)
-    {
-        if (stemmer != null)
-        {
-            this.stemmer = stemmer;
-        }
-    }
-}
+package org.apache.lucene.analysis.ru;
+
+/* 
====================================================================
+ * The Apache Software License, Version 1.1
+ *
+ * Copyright (c) 2001 The Apache Software Foundation.  All rights
+ * reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above 
copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *
+ * 3. The end-user documentation included with the redistribution,
+ *    if any, must include the following acknowledgment:
+ *       "This product includes software developed by the
+ *        Apache Software Foundation (http://www.apache.org/)."
+ *    Alternately, this acknowledgment may appear in the software 
itself,
+ *    if and wherever such third-party acknowledgments normally 
appear.
+ *
+ * 4. The names "Apache" and "Apache Software Foundation" and
+ *    "Apache Lucene" must not be used to endorse or promote products
+ *    derived from this software without prior written permission. 
For
+ *    written permission, please contact apache@apache.org.
+ *
+ * 5. Products derived from this software may not be called "Apache",
+ *    "Apache Lucene", nor may "Apache" appear in their name, without
+ *    prior written permission of the Apache Software Foundation.
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
+ * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
+ * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED 
AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ * 
====================================================================
+ *
+ * This software consists of voluntary contributions made by many
+ * individuals on behalf of the Apache Software Foundation.  For more
+ * information on the Apache Software Foundation, please see
+ * <http://www.apache.org/>.
+ */
+
+import org.apache.lucene.analysis.Token;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import java.io.IOException;
+
+/**
+ * A filter that stems Russian words. The implementation was inspired 
by GermanStemFilter.
+ * The input should be filtered by RussianLowerCaseFilter before 
passing it to RussianStemFilter ,
+ * because RussianStemFilter only works  with lowercase part of any 
"russian" charset.
+ *
+ * @author    Boris Okner, b.okner@rogers.com
+ * @version   $Id: RussianStemFilter.java,v 1.4 2003/01/29 17:18:53 
otis Exp $
+ */
+public final class RussianStemFilter extends TokenFilter
+{
+    /**
+     * The actual token in the input stream.
+     */
+    private Token token = null;
+    private RussianStemmer stemmer = null;
+
+       public RussianStemFilter(TokenStream in)
+       {
+               super(in);
+               stemmer = new RussianStemmer(RussianCharsets.CP1251);
+       }
+
+       public RussianStemFilter(TokenStream in, char[] charset)
+       {
+               super(in);
+               stemmer = new RussianStemmer(charset);
+       }
+
+    /**
+     * @return  Returns the next token in the stream, or null at EOS
+     */
+    public final Token next() throws IOException
+    {
+        if ((token = input.next()) == null)
+        {
+            return null;
+        }
+        else
+        {
+            String s = stemmer.stem(token.termText());
+            if (!s.equals(token.termText()))
+            {
+                return new Token(s, token.startOffset(), 
token.endOffset(),
+                    token.type());
+            }
+            return token;
+        }
+    }
+
+    /**
+     * Set a alternative/custom RussianStemmer for this filter.
+     */
+    public void setStemmer(RussianStemmer stemmer)
+    {
+        if (stemmer != null)
+        {
+            this.stemmer = stemmer;
+        }
+    }
+}
##################################################
Index: TestRussianAnalyzer.java
===================================================================
RCS file: 
/home/cvspublic/jakarta-lucene/src/test/org/apache/lucene/analysis/ru/TestRussianAnalyzer.java,v
retrieving revision 1.5
diff -u -r1.5 TestRussianAnalyzer.java
--- TestRussianAnalyzer.java    20 Oct 2003 18:07:57 -0000      1.5
+++ TestRussianAnalyzer.java    17 Mar 2004 11:42:20 -0000
@@ -1,208 +1,243 @@
-package org.apache.lucene.analysis.ru;
-
-/* 
====================================================================
- * The Apache Software License, Version 1.1
- *
- * Copyright (c) 2001 The Apache Software Foundation.  All rights
- * reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above 
copyright
- *    notice, this list of conditions and the following disclaimer in
- *    the documentation and/or other materials provided with the
- *    distribution.
- *
- * 3. The end-user documentation included with the redistribution,
- *    if any, must include the following acknowledgment:
- *       "This product includes software developed by the
- *        Apache Software Foundation (http://www.apache.org/)."
- *    Alternately, this acknowledgment may appear in the software 
itself,
- *    if and wherever such third-party acknowledgments normally 
appear.
- *
- * 4. The names "Apache" and "Apache Software Foundation" and
- *    "Apache Lucene" must not be used to endorse or promote products
- *    derived from this software without prior written permission. 
For
- *    written permission, please contact apache@apache.org.
- *
- * 5. Products derived from this software may not be called "Apache",
- *    "Apache Lucene", nor may "Apache" appear in their name, without
- *    prior written permission of the Apache Software Foundation.
- *
- * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
- * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
- * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
- * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
- * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED 
AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
- * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- * 
====================================================================
- *
- * This software consists of voluntary contributions made by many
- * individuals on behalf of the Apache Software Foundation.  For more
- * information on the Apache Software Foundation, please see
- * <http://www.apache.org/>.
- */
-
-import junit.framework.TestCase;
-
-import java.io.*;
-
-import org.apache.lucene.analysis.TokenStream;
-import org.apache.lucene.analysis.Token;
-
-/**
- * Test case for RussianAnalyzer.
- *
- * @author    Boris Okner
- * @version   $Id: TestRussianAnalyzer.java,v 1.5 2003/10/20 18:07:57 
ehatcher Exp $
- */
-
-public class TestRussianAnalyzer extends TestCase
-{
-    private InputStreamReader inWords;
-
-    private InputStreamReader sampleUnicode;
-
-    private Reader inWordsKOI8;
-
-    private Reader sampleKOI8;
-
-    private Reader inWords1251;
-
-    private Reader sample1251;
-
-    private File dataDir;
-
-    protected void setUp() throws Exception
-    {
-      dataDir = new File(System.getProperty("dataDir"));
-    }
-
-    public void testUnicode() throws IOException
-    {
-        RussianAnalyzer ra = new 
RussianAnalyzer(RussianCharsets.UnicodeRussian);
-        inWords =
-            new InputStreamReader(
-                new FileInputStream(new File(dataDir, 
"/org/apache/lucene/analysis/ru/testUnicode.txt")),
-                "Unicode");
-
-        sampleUnicode =
-            new InputStreamReader(
-                new FileInputStream(new File(dataDir, 
"/org/apache/lucene/analysis/ru/resUnicode.htm")),
-                "Unicode");
-
-        TokenStream in = ra.tokenStream("all", inWords);
-
-        RussianLetterTokenizer sample =
-            new RussianLetterTokenizer(
-                sampleUnicode,
-                RussianCharsets.UnicodeRussian);
-
-        for (;;)
-        {
-            Token token = in.next();
-
-            if (token == null)
-            {
-                break;
-            }
-
-            Token sampleToken = sample.next();
-            assertEquals(
-                "Unicode",
-                token.termText(),
-                sampleToken == null
-                ? null
-                : sampleToken.termText());
-        }
-
-        inWords.close();
-        sampleUnicode.close();
-    }
-
-    public void testKOI8() throws IOException
-    {
-        //System.out.println(new java.util.Date());
-        RussianAnalyzer ra = new 
RussianAnalyzer(RussianCharsets.KOI8);
-        // KOI8
-        inWordsKOI8 = new InputStreamReader(new FileInputStream(new 
File(dataDir, "/org/apache/lucene/analysis/ru/testKOI8.txt")), 
"iso-8859-1");
-
-        sampleKOI8 = new InputStreamReader(new FileInputStream(new 
File(dataDir, "/org/apache/lucene/analysis/ru/resKOI8.htm")), 
"iso-8859-1");
-
-        TokenStream in = ra.tokenStream("all", inWordsKOI8);
-        RussianLetterTokenizer sample =
-            new RussianLetterTokenizer(
-                sampleKOI8,
-                RussianCharsets.KOI8);
-
-        for (;;)
-        {
-            Token token = in.next();
-
-            if (token == null)
-            {
-                break;
-            }
-
-            Token sampleToken = sample.next();
-            assertEquals(
-                "KOI8",
-                token.termText(),
-                sampleToken == null
-                ? null
-                : sampleToken.termText());
-
-        }
-
-        inWordsKOI8.close();
-        sampleKOI8.close();
-    }
-
-    public void test1251() throws IOException
-    {
-        // 1251
-        inWords1251 = new InputStreamReader(new FileInputStream(new 
File(dataDir, "/org/apache/lucene/analysis/ru/test1251.txt")), 
"iso-8859-1");
-
-        sample1251 = new InputStreamReader(new FileInputStream(new 
File(dataDir, "/org/apache/lucene/analysis/ru/res1251.htm")), 
"iso-8859-1");
-
-        RussianAnalyzer ra = new 
RussianAnalyzer(RussianCharsets.CP1251);
-        TokenStream in = ra.tokenStream("", inWords1251);
-        RussianLetterTokenizer sample =
-            new RussianLetterTokenizer(
-                sample1251,
-                RussianCharsets.CP1251);
-
-        for (;;)
-        {
-            Token token = in.next();
-
-            if (token == null)
-            {
-                break;
-            }
-
-            Token sampleToken = sample.next();
-            assertEquals(
-                "1251",
-                token.termText(),
-                sampleToken == null
-                ? null
-                : sampleToken.termText());
-
-        }
-
-        inWords1251.close();
-        sample1251.close();
-    }
-}
+package org.apache.lucene.analysis.ru;
+
+/* 
====================================================================
+ * The Apache Software License, Version 1.1
+ *
+ * Copyright (c) 2001 The Apache Software Foundation.  All rights
+ * reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above 
copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *
+ * 3. The end-user documentation included with the redistribution,
+ *    if any, must include the following acknowledgment:
+ *       "This product includes software developed by the
+ *        Apache Software Foundation (http://www.apache.org/)."
+ *    Alternately, this acknowledgment may appear in the software 
itself,
+ *    if and wherever such third-party acknowledgments normally 
appear.
+ *
+ * 4. The names "Apache" and "Apache Software Foundation" and
+ *    "Apache Lucene" must not be used to endorse or promote products
+ *    derived from this software without prior written permission. 
For
+ *    written permission, please contact apache@apache.org.
+ *
+ * 5. Products derived from this software may not be called "Apache",
+ *    "Apache Lucene", nor may "Apache" appear in their name, without
+ *    prior written permission of the Apache Software Foundation.
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
+ * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
+ * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED 
AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ * 
====================================================================
+ *
+ * This software consists of voluntary contributions made by many
+ * individuals on behalf of the Apache Software Foundation.  For more
+ * information on the Apache Software Foundation, please see
+ * <http://www.apache.org/>.
+ */
+
+import junit.framework.TestCase;
+
+import java.io.*;
+
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.Token;
+
+/**
+ * Test case for RussianAnalyzer.
+ *
+ * @author    Boris Okner
+ * @version   $Id: TestRussianAnalyzer.java,v 1.5 2003/10/20 18:07:57 
ehatcher Exp $
+ */
+
+public class TestRussianAnalyzer extends TestCase
+{
+    private InputStreamReader inWords;
+
+    private InputStreamReader sampleUnicode;
+
+    private Reader inWordsKOI8;
+
+    private Reader sampleKOI8;
+
+    private Reader inWords1251;
+
+    private Reader sample1251;
+
+    private File dataDir;
+
+    protected void setUp() throws Exception
+    {
+      dataDir = new File(System.getProperty("dataDir"));
+    }
+
+    public void testUnicode() throws IOException
+    {
+        RussianAnalyzer ra = new 
RussianAnalyzer(RussianCharsets.UnicodeRussian);
+        inWords =
+            new InputStreamReader(
+                new FileInputStream(new File(dataDir, 
"/org/apache/lucene/analysis/ru/testUnicode.txt")),
+                "Unicode");
+
+        sampleUnicode =
+            new InputStreamReader(
+                new FileInputStream(new File(dataDir, 
"/org/apache/lucene/analysis/ru/resUnicode.htm")),
+                "Unicode");
+
+        TokenStream in = ra.tokenStream("all", inWords);
+
+        RussianLetterTokenizer sample =
+            new RussianLetterTokenizer(
+                sampleUnicode,
+                RussianCharsets.UnicodeRussian);
+
+        for (;;)
+        {
+            Token token = in.next();
+
+            if (token == null)
+            {
+                break;
+            }
+
+            Token sampleToken = sample.next();
+            assertEquals(
+                "Unicode",
+                token.termText(),
+                sampleToken == null
+                ? null
+                : sampleToken.termText());
+        }
+
+        inWords.close();
+        sampleUnicode.close();
+    }
+
+    public void testKOI8() throws IOException
+    {
+        //System.out.println(new java.util.Date());
+        RussianAnalyzer ra = new 
RussianAnalyzer(RussianCharsets.KOI8);
+        // KOI8
+        inWordsKOI8 = new InputStreamReader(new FileInputStream(new 
File(dataDir, "/org/apache/lucene/analysis/ru/testKOI8.txt")), 
"iso-8859-1");
+
+        sampleKOI8 = new InputStreamReader(new FileInputStream(new 
File(dataDir, "/org/apache/lucene/analysis/ru/resKOI8.htm")), 
"iso-8859-1");
+
+        TokenStream in = ra.tokenStream("all", inWordsKOI8);
+        RussianLetterTokenizer sample =
+            new RussianLetterTokenizer(
+                sampleKOI8,
+                RussianCharsets.KOI8);
+
+        for (;;)
+        {
+            Token token = in.next();
+
+            if (token == null)
+            {
+                break;
+            }
+
+            Token sampleToken = sample.next();
+            assertEquals(
+                "KOI8",
+                token.termText(),
+                sampleToken == null
+                ? null
+                : sampleToken.termText());
+
+        }
+
+        inWordsKOI8.close();
+        sampleKOI8.close();
+    }
+
+       public void test1251() throws IOException
+       {
+               // 1251
+               inWords1251 = new InputStreamReader(new 
FileInputStream(new File(dataDir, 
"/org/apache/lucene/analysis/ru/test1251.txt")), "iso-8859-1");
+
+               sample1251 = new InputStreamReader(new 
FileInputStream(new File(dataDir, 
"/org/apache/lucene/analysis/ru/res1251.htm")), "iso-8859-1");
+
+               RussianAnalyzer ra = new 
RussianAnalyzer(RussianCharsets.CP1251);
+               TokenStream in = ra.tokenStream("", inWords1251);
+               RussianLetterTokenizer sample =
+                       new RussianLetterTokenizer(
+                               sample1251,
+                               RussianCharsets.CP1251);
+
+               for (;;)
+               {
+                       Token token = in.next();
+
+                       if (token == null)
+                       {
+                               break;
+                       }
+
+                       Token sampleToken = sample.next();
+                       assertEquals(
+                               "1251",
+                               token.termText(),
+                               sampleToken == null
+                               ? null
+                               : sampleToken.termText());
+
+               }
+
+               inWords1251.close();
+               sample1251.close();
+       }
+       public void test() throws IOException
+       {
+               // 1251
+               inWords1251 = new InputStreamReader(new 
FileInputStream(new File(dataDir, 
"/org/apache/lucene/analysis/ru/test1251.txt")), "iso-8859-1");
+
+               sample1251 = new InputStreamReader(new 
FileInputStream(new File(dataDir, 
"/org/apache/lucene/analysis/ru/res1251.htm")), "iso-8859-1");
+
+               RussianAnalyzer ra = new RussianAnalyzer();
+               TokenStream in = ra.tokenStream("", inWords1251);
+               RussianLetterTokenizer sample =
+                       new RussianLetterTokenizer(
+                               sample1251);
+
+               for (;;)
+               {
+                       Token token = in.next();
+
+                       if (token == null)
+                       {
+                               break;
+                       }
+
+                       Token sampleToken = sample.next();
+                       assertEquals(
+                               "1251",
+                               token.termText(),
+                               sampleToken == null
+                               ? null
+                               : sampleToken.termText());
+
+               }
+
+               inWords1251.close();
+               sample1251.close();
+       }
+}
#############################################

I am sorry, if in the first letter there was a virus. I work on Linux, 
but can be...

Regards,
Vladimir Yuryev.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message