lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Bundling Snowball
Date Wed, 10 Nov 2010 19:19:14 GMT
On Wed, Nov 10, 2010 at 12:10:51PM -0500, Robert Muir wrote:
> One more note that I forgot to mention: in snowball's svn (but i think not
> in the libstemmer pkg) there is actually vocabulary test data: input files
> containing a sample vocabulary for each language, expected output, and
> combined files called 'diffs' that show what the stemmer changes.
> 
> these provide pretty good coverage for tests to ensure your
> integration is working... when they make a change to the algorithms
> these are updated too (though it seems not always in the same commit):
> 
> example: http://svn.tartarus.org/snowball/trunk/data/german/diffs.txt?r1=527&r2=526&pathrev=527

I used this sample data to prepare tests for the Lingua::Stem::Snowball CPAN
distribution.  Now that we are bundling the Snowball C libraries, we are no
longer benefitting by proxy from that test suite, and we should roll our own
tests.

Yesterday, I adapted the update_snowstem.pl script in
<https://issues.apache.org/jira/browse/LUCY-125> to work off of an svn
checkout of Snowball; I committed the patches and closed the issue this
morning.

Now I'll go add test data generation to update_snowstem.pl's capabilities and
add new test files for each language to validate that our stemmers work
properly.

Thanks for bringing it up!

Marvin Humphrey


Mime
View raw message