lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Writing a stemmer
Date Sat, 05 Jun 2004 19:15:23 GMT
Vladimir Yuryev wrote:

> Hi, Andjej!
> 
> How you tested the Polish texts with what stemer?
> Thanks,
> Vladimir.
> 
>>
>> No reason to be too modest, Leo.. I tested your stemmer on English, 
>> Swedish and Polish texts (including F-measure vs. training set size 
>> plots), and it works exceptionally well indeed. Highly recommended!

Well, I have several corpora of Polish language, which together amount 
to roughly 90,000 words (nouns and verbs) having at least 4 inflected 
forms. This set is randomized (i.e. lines of words + forms are in random 
order). I've split this into two parts - one of a fixed size, as a test 
set, and one of variable size as a training set. Then I compile stemmer 
tables using variable number of training examples, and using differnt 
settings (trie, multi-trie, different optimizations, etc..). Then for 
each output table I test the precision/recall of correct base forms 
(lemmatization), and of ability to create unique stems (stemming). 
Finally, I select the "best" table, which gives reasonably good results 
vs. table size. To put it in plain terms, e.g. for tables roughly 300kB 
in size (created from training set of 3000 unique words + their forms) 
in best cases I get ~90% of correct stems, and ~70% of correct lemmas. 
Which is a _very_ good result!

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message