lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vladimir Yuryev" <>
Subject Re: Writing a stemmer
Date Mon, 07 Jun 2004 04:24:21 GMT
On Sat, 05 Jun 2004 21:15:23 +0200
  Andrzej Bialecki <> wrote:
>Vladimir Yuryev wrote:
>> Hi, Andjej!
>> How you tested the Polish texts with what stemer?
>> Thanks,
>> Vladimir.
>>> No reason to be too modest, Leo.. I tested your stemmer on English, 
>>> Swedish and Polish texts (including F-measure vs. training set size 
>>> plots), and it works exceptionally well indeed. Highly recommended!
>Well, I have several corpora of Polish language, which together 
>amount to roughly 90,000 words (nouns and verbs) having at least 4 
>inflected forms. This set is randomized (i.e. lines of words + forms 
>are in random order). I've split this into two parts - one of a fixed 
>size, as a test set, and one of variable size as a training set. Then 
>I compile stemmer tables using variable number of training examples, 
>and using differnt settings (trie, multi-trie, different 
>optimizations, etc..). Then for each output table I test the 
>precision/recall of correct base forms (lemmatization), and of 
>ability to create unique stems (stemming). Finally, I select the 
>"best" table, which gives reasonably good results vs. table size. To 
>put it in plain terms, e.g. for tables roughly 300kB in size (created 
>from training set of 3000 unique words + their forms) in best cases I 
>get ~90% of correct stems, and ~70% of correct lemmas. Which is a 
>_very_ good result!
>Best regards,
>Andrzej Bialecki
Thanks for the detailed description of the test of the Polish texts. 
It was very important for me.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message