lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <>
Subject Re: snowball discussion on LUCENE-2285
Date Sat, 27 Feb 2010 18:16:17 GMT
Hi Robert, the EMPTY_ARGS stuff is just in SnowballProgram. I didn't touch
the generated code, besides handling calling deprecated API.

We've actually taken the same approach I think :). In my Analyzer, the user
passes a Locale to create the proper Analyzer. The analyzer comes
pre-configured w/ all bunch of filters, like those that handle email tokens
produced by the tokenizer (or hosts, acronyms and more), character
normalization, ngram/stemmer filters etc. The StemmerFilter creates the
proper stemmer based on the language code, and for that I created a
SnowballWrapper - that allows me to instantiate Arabic/Hebrew or Snowball
ones. The wrapper is only needed for the stemmer filter instance ...

I have on my TODO checking contrib/analyzers. Unfortunately our legal
department is very suspicious of everything (guess they wouldn't make good
legat folks otherwise ;)). If I'll want to use the contrib/analyzers,
they'll need to scan the code and identify the owners of the various
analyzers ... That's what's on my TODO - going through the process w/ them

I personally think that the work you're doing on the analyzers is
extraordinary, and since I don't have much time maintaining my own package,
it has fallen a bit behind in terms of Unicode differences and such. I've
come to appreciate the power of open source long ago - for me it'd be best
to join forces on this analysis package. I'm sure that will happen one day

About the Hungarian stemmer - Martin Porter told us that the original (12?)
stemmers were written by him and so there's no IP issues. The rest were
contributed by other people. All but the Hun contributor responded w/ their
rights to contribute the code. It's just the Hun that never responded, even
though we've sent a couple of emails. That is problematic. When someone
contributes code to Lucene, he grants the ASF license (forgot the wording
that's used). That's very reassuring to lawyers, because it doesn't leave
them too exposed. But there isn't any similar process in Snowball ... I can
look up the correspondence we've had with Martin Porter to refresh my memory
on the detailds.
On Sat, Feb 27, 2010 at 5:35 PM, Robert Muir <> wrote:

> i wanted to continue this here to not clog up the issue!
> Shai Erera commented on LUCENE-2285:
>> bq. I'd be curious to know what you did
>> Ok, now you've made me compare the two :). I'm happy to see we both did
>> the same thing, only you call your buffer 'current' while I call it 'buf'.
>> Besides that I've included a static final EMPTY_ARGS instead of all the
>> places where 'new Object[0]' is passed. Nothing too fancy.
> hmm, i didnt think of this second optimization, does it affect generated
> code or is it in Among/SnowballProgram?
>> Another thing is that I wrote an Arabic and Hebrew stemmer, and combined
>> them w/ the Snowball ones by introducing a stemmer class which can be either
>> Snowball or anything else. I'll check if we're allowed to contribute the
>> Hebrew stemmer to Lucene ...
> please do.  as far as integration goes, i guess we took a different
> approach with LUCENE-2055 (from the Analyzer perspective, the user does not
> care if it uses snowball or something else behind the scenes, etc).
>> BTW FYI - our legal department forbid us to use the Hungarian stemmer
>> because of licensing/legal issues. Besides the stemmers that were originally
>> provided, the Snowball project accepted additional ones like the Hungarian
>> stemmer. However, for that one we weren't able to get a confirmation from
>> the contributor his University indeed gave him permission to contribute the
>> code. Don't know if it matters to anyone here (I've notified Martin Porter
>> as well), but FYI. Our legal department does not permit us to use it (which
>> is not surprising - they are legal ...). I don't want to derail this issue
>> into Snowball discussion, so if you want to talk about it, I suggest we move
>> it to the list.
> this is concerning to me, i mean the thing is sitting there on the
> universities website: :)
> but if apache is concerned about this situation too, someone let me know
> and i can take savoy's (clearly marked BSD) and we can add that instead, and
> remove the ambiguous snowball one, even if its temporary:
> --
> Robert Muir

View raw message