lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <>
Subject Re: Fix for Japanese SEN morphological analyzer, and moving into Contrib
Date Mon, 12 Oct 2009 19:11:18 GMT
Mark, I agree with what you said, it would be great if there was a way to
easily enable this japanese support.

I will let someone else comment on the licensing, but I since you mentioned
source dictionaries, thought Sen only used IPA dic for its data? I could be
wrong on this.

I think its a BSD-like license, here you can read the license as google
chrome prints it... (in a separate really interesting dictionary for CJ

On Mon, Oct 12, 2009 at 2:58 PM, Mark Bennett <> wrote:

> Hello Robert,
> That's a good question.  The core SEN is under LGPL, yes.  However, I
> didn't need to make changes to that, though given that there are 2 versions
> floating around, I think it needs a good home.
> But the glue-layer is under "Apache 2.0" license, and that's the part that
> needed fixing.  I think that means it's ASF / contrib compatible?
> There are also 2 other ancillary libraries and some source dictionaries
> which I need to research.
> Working from the other direction, which might give you some ideas:
> The goal is to get this more accessible.  It'd be really nice if, in a
> Lucene distribution, the SEN library could be switched on as easily as CJK.
> Or at the most you'd run an ant script to fetch all the parts and assemble
> it.  As it stands now I think it's not used much because it's a bit complex
> to setup, even prior to May '09's change, and most of the users of it
> discuss it in Japanese.  So that's the goal, I'm very open to ideas on the
> "how".
> Mark
> --
> Mark Bennett / New Idea Engineering, Inc. /
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
> On Mon, Oct 12, 2009 at 11:10 AM, Robert Muir <> wrote:
>> Mark, does this mean Sen will be under the Apache license? (it is
>> currently LGPL)
>> On Mon, Oct 12, 2009 at 1:46 PM, Mark Bennett <>wrote:
>>> Hi folks,
>>> I've been working to fix the Japanese SEN morphological analyzer, which
>>> is currently hosted at:
>>> To review, Japanese doesn't use whitespace for word breaks.  The
>>> traditional approach to CJK (Chinese, Japanese, Korean) is to use bigram
>>> character pairs in the index.  While this works to a point, some believe
>>> that using proper word breaks provides better results.
>>> The "lucene-ja" glue layer between Lucene and the core SEN library broke
>>> in May of '09 when a fix was made in Lucene:
>>> Uwe S. had a very good insight for a quick fix, and I have been cleaning
>>> up some other issues with the code.  I have also spoken the author Takashi
>>> Okamoto and he is fine to have this moved from to ASF; I think
>>> it will be easier for folks to find and use it if it's in ASF.
>>> I'm not quite ready to submit a patch, but the Wiki suggests emailing the
>>> list with the idea in advance.  There are some packaging questions I'll
>>> have, there's actually quite a few parts.  Also, the wiki didn't quite spell
>>> out the process to get things into contrib, beyond emailing and submitting a
>>> patch.  I also plan to eventually submit a Solr-specific wrapper to the solr
>>> dev list, to allow for dynamic config changes to be made from Solr's
>>> schema.  But since the original code was Lucene based, and it provides the
>>> broadest reach, I think having it in core Lucene would be a good start.
>>> Any comments, suggestions, or mentor volunteers?  :-)
>>> Mark
>>> --
>>> Mark Bennett / New Idea Engineering, Inc. /
>>> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>> --
>> Robert Muir

Robert Muir

View raw message