lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Boris Okner" <b.ok...@rogers.com>
Subject RussianAnalyzer
Date Wed, 21 Aug 2002 11:56:03 GMT
This is my contribution to Lucene project.

RussianAnalyzer v. 1.0 (attachment:russianLucene.zip)

RussianAnalyzer implements org.apache.lucene.analysis.Analyzer and designed to support indexing/search
capabilities for Cyrillic in Lucene. Currently, 3 encoding schemas can be used out of the
box: Unicode, KOI8 and CP1251. For those who wants to use other encoding schemas, even custom
ones, please look at RusianCharsets class - it should be very straightforward to add any encoding.

RussianAnalyzer uses RussianStemFilter, based on algorithm, described at Snowball's site (http://snowball.sourceforge.net),
and also StopFilter with Russian stop-words. I was never able to find a comprehensive list
of stop-words, so please feel free to add whatever stop words you'll find missing. 

There are 2 JUnit testcases: 

1) RussianStemTest, designed to test stemming. It takes sample Russian vocabulary (wordsUnicode),
produces stem for each word, and then compares it to stem from stemmed version of vocabulary(stemsUnicode.txt).
Vocabulary and its stemmed version were taken from Snowball's site(they contain more than
49000 words and stems), so passing test means that implementation of stemming algorithm is
consistent with SnowBall's description.

2) RussianAnalyzerTest contains 3 tests to check RussianAnalyzer on Unicode, KOI8 and CP1251.For
each test it takes appropriate input (testUnicode.txt, testKOI8.txt and test1251.txt), and
produces tokens that then get verified one by one against expected results (placed respectively
in resUnicode.htm, resKOI8.htm and res1251.htm) 


To run tescases:

1)Unzip russianLucene.zip to any directory

2)From command line, cd to the directory from 1), and run (adjusting the path to your junit.jar
and lucene.jar):

 java -cp .;junit_37.jar;lucene-1.2.jar ca.oksphere.lucene.RussianAnalyzerTest

 java -cp .;junit_37.jar ca.oksphere.lucene.RussianStemTest



That's pretty much it. I hope you'll enjoy it. If you have any questions/comments etc., please
send me a message to:

b.okner@rogers.com

Boris Okner

_____________________________________________________________________________________________________

LEGAL STUFF:

*

* THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED

* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES

* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE

* DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR ITS CONTRIBUTORS BE 

* LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, 

* OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT 

* OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR 

* BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 

* WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE 

* OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN 

* IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

* ====================================================================



Mime
View raw message