lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: RussianAnalyzer
Date Mon, 16 Sep 2002 02:54:00 GMT
Hello,

I put the code in the CVS.
I still have to put the unit tests and unit test data in.
Damn, I just realized that I didn't even run the tests before putting
code in CVS.  I hope it all works :)

Thank you,
Otis


--- Boris Okner <b.okner@rogers.com> wrote:
> This is my contribution to Lucene project.
> 
> RussianAnalyzer v. 1.0 (attachment:russianLucene.zip)
> 
> RussianAnalyzer implements org.apache.lucene.analysis.Analyzer and
> designed to support indexing/search capabilities for Cyrillic in
> Lucene. Currently, 3 encoding schemas can be used out of the box:
> Unicode, KOI8 and CP1251. For those who wants to use other encoding
> schemas, even custom ones, please look at RusianCharsets class - it
> should be very straightforward to add any encoding.
> 
> RussianAnalyzer uses RussianStemFilter, based on algorithm, described
> at Snowball's site (http://snowball.sourceforge.net), and also
> StopFilter with Russian stop-words. I was never able to find a
> comprehensive list of stop-words, so please feel free to add whatever
> stop words you'll find missing. 
> 
> There are 2 JUnit testcases: 
> 
> 1) RussianStemTest, designed to test stemming. It takes sample
> Russian vocabulary (wordsUnicode), produces stem for each word, and
> then compares it to stem from stemmed version of
> vocabulary(stemsUnicode.txt). Vocabulary and its stemmed version were
> taken from Snowball's site(they contain more than 49000 words and
> stems), so passing test means that implementation of stemming
> algorithm is consistent with SnowBall's description.
> 
> 2) RussianAnalyzerTest contains 3 tests to check RussianAnalyzer on
> Unicode, KOI8 and CP1251.For each test it takes appropriate input
> (testUnicode.txt, testKOI8.txt and test1251.txt), and produces tokens
> that then get verified one by one against expected results (placed
> respectively in resUnicode.htm, resKOI8.htm and res1251.htm) 
> 
> 
> To run tescases:
> 
> 1)Unzip russianLucene.zip to any directory
> 
> 2)From command line, cd to the directory from 1), and run (adjusting
> the path to your junit.jar and lucene.jar):
> 
>  java -cp .;junit_37.jar;lucene-1.2.jar
> ca.oksphere.lucene.RussianAnalyzerTest
> 
>  java -cp .;junit_37.jar ca.oksphere.lucene.RussianStemTest
> 
> 
> 
> That's pretty much it. I hope you'll enjoy it. If you have any
> questions/comments etc., please send me a message to:
> 
> b.okner@rogers.com
> 
> Boris Okner
> 
>
_____________________________________________________________________________________________________
> 
> LEGAL STUFF:
> 
> *
> 
> * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
> 
> * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
> 
> * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
> 
> * DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR ITS CONTRIBUTORS BE 
> 
> * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, 
> 
> * OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
> PROCUREMENT 
> 
> * OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR 
> 
> * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
> LIABILITY, 
> 
> * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
> NEGLIGENCE 
> 
> * OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
> EVEN 
> 
> * IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> 
> *
> ====================================================================
> 
> 
> 

> ATTACHMENT part 2 application/x-zip-compressed name=russianLucene.zip



__________________________________________________
Do you Yahoo!?
Yahoo! News - Today's headlines
http://news.yahoo.com

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message