incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: [lucy-dev] Bundling Snowball
Date Tue, 09 Nov 2010 09:51:33 GMT
On Mon, Nov 8, 2010 at 10:01 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> The "semiclean" build target has been added.  I opened
> <https://issues.apache.org/jira/browse/LUCY-125> for bundling the Snowball
> stemming library.  A separate issue will follow for bundling the stoplists.
>

Some quick notes, from lucene-java:
* are you going to do svn checkouts for bundling snowball? I don't
think they are really releasing anymore, but there are in fact new
languages, etc in svn.
* every so often snowball makes changes to the rules for the
languages.. this can be tricky depending on how you handle backwards
compatibility. In lucene java we have a checkout of revision 502, but
then with the newer languages added (Armenian, Catalan, Basque)... if
we fully 'svn updated' to the latest rev it would change things about
german stemming from our previous release, for example, and be a
hassle for people who created indexes with those older versions.
* when bundling the stoplists: there are some languages, even
"released" ones (Turkish, Romanian, etc) that don't have
snowball-included stoplists. if you want, you could use the ones we
have in lucene to provide stoplists for these languages...

http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/hy/stopwords.txt
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/eu/stopwords.txt
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ca/stopwords.txt

these are of variable quality: the ones with source information in the
header means that I found one clearly marked with BSD or Apache.
If they have no header, it means i made them myself... it might seem
absurd to worry about "licensing" for stopwords, but you never know :)

Mime
View raw message