incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Bundling Snowball
Date Tue, 09 Nov 2010 20:53:20 GMT
On Tue, Nov 09, 2010 at 04:51:33AM -0500, Robert Muir wrote:
> Some quick notes, from lucene-java:

Thanks, Robert!  The Lucene analysis components have really tightened up since
you got involved, and I'm pleased that Lucy will get to benefit from your
hard-won knowledge as well.

> * are you going to do svn checkouts for bundling snowball? 

Good plan.  Debian does something similar:

    http://thread.gmane.org/gmane.comp.search.snowball/1191

I'd been working off libstemmer_c.tgz, but to document my actions and make
them repeatable, I've written a script which transforms the content of a
source dir (right now the expanded libstemmer_c dir) into the form that we
need.

That script should probably be changed to operate off of an svn checkout from
the Snowball repository -- or perhaps multiple svn checkouts.  That way we can
1) document exactly what revision of the Snowball code we've imported, and 2)
get the most up-to-date and complete complement of languages.

> I don't think they are really releasing anymore, but there are in fact new
> languages, etc in svn.

It's seriously a pain that the Snowball folks don't do numbered releases.  :(

Back in 2007, Richard Boulton discussed adding revision info to the
libstemmer.h interface which would allow you to track the stemmer version, but
it doesn't look like they ever got around to it.

> * every so often snowball makes changes to the rules for the
> languages.. this can be tricky depending on how you handle backwards
> compatibility. In lucene java we have a checkout of revision 502, but
> then with the newer languages added (Armenian, Catalan, Basque)... if
> we fully 'svn updated' to the latest rev it would change things about
> german stemming from our previous release, for example, and be a
> hassle for people who created indexes with those older versions.

The easy answer is to freeze the stemmer version for each language, as you've
done.  Basically, import once, do it right the first time, and don't update
ever again.  Or at least understand that you are potentially breaking back
compat if you ever update a language.

> * whee bundling the stoplists: there are some languages, even
> "released" ones (Turkish, Romanian, etc) that don't have
> snowball-included stoplists. if you want, you could use the ones we
> have in lucene to provide stoplists for these languages...
> 
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hy/stopwords.txt
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/eu/stopwords.txt
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ca/stopwords.txt
> 
> these are of variable quality: the ones with source information in the
> header means that I found one clearly marked with BSD or Apache.
> If they have no header, it means i made them myself... it might seem
> absurd to worry about "licensing" for stopwords, but you never know :)

That would be handy, as it would allow us to build a
tokenizer/stopalizer/stemmer stack for each supported Snowball language. 

Marvin Humphrey


Mime
View raw message