lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: snowball discussion on LUCENE-2285
Date Sat, 27 Feb 2010 18:48:33 GMT
Can you open an issue for the new object[]?  its sad about the hungarian
issue.  I'm inclined to think we should add savoy's and default to it
instead.  I don't see this as code duplication, as its a different alg.
Normally just don't spend a lot of effort towards adding alternative
stemmers, but here it makes sense.

It sounds really exciting if you are able to merge in what you have done in
the future!

On Feb 27, 2010 1:16 PM, "Shai Erera" <serera@gmail.com> wrote:

Hi Robert, the EMPTY_ARGS stuff is just in SnowballProgram. I didn't touch
the generated code, besides handling calling deprecated API.

We've actually taken the same approach I think :). In my Analyzer, the user
passes a Locale to create the proper Analyzer. The analyzer comes
pre-configured w/ all bunch of filters, like those that handle email tokens
produced by the tokenizer (or hosts, acronyms and more), character
normalization, ngram/stemmer filters etc. The StemmerFilter creates the
proper stemmer based on the language code, and for that I created a
SnowballWrapper - that allows me to instantiate Arabic/Hebrew or Snowball
ones. The wrapper is only needed for the stemmer filter instance ...

I have on my TODO checking contrib/analyzers. Unfortunately our legal
department is very suspicious of everything (guess they wouldn't make good
legat folks otherwise ;)). If I'll want to use the contrib/analyzers,
they'll need to scan the code and identify the owners of the various
analyzers ... That's what's on my TODO - going through the process w/ them
:).

I personally think that the work you're doing on the analyzers is
extraordinary, and since I don't have much time maintaining my own package,
it has fallen a bit behind in terms of Unicode differences and such. I've
come to appreciate the power of open source long ago - for me it'd be best
to join forces on this analysis package. I'm sure that will happen one day
:).

About the Hungarian stemmer - Martin Porter told us that the original (12?)
stemmers were written by him and so there's no IP issues. The rest were
contributed by other people. All but the Hun contributor responded w/ their
rights to contribute the code. It's just the Hun that never responded, even
though we've sent a couple of emails. That is problematic. When someone
contributes code to Lucene, he grants the ASF license (forgot the wording
that's used). That's very reassuring to lawyers, because it doesn't leave
them too exposed. But there isn't any similar process in Snowball ... I can
look up the correspondence we've had with Martin Porter to refresh my memory
on the detailds.
 Shai

On Sat, Feb 27, 2010 at 5:35 PM, Robert Muir <rcmuir@gmail.com> wrote:
>
> i wanted to continue this...

Mime
View raw message