lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Bennett <mbenn...@ideaeng.com>
Subject Re: Fix for Japanese SEN morphological analyzer, and moving into Contrib
Date Mon, 12 Oct 2009 18:58:54 GMT
Hello Robert,

That's a good question.  The core SEN is under LGPL, yes.  However, I didn't
need to make changes to that, though given that there are 2 versions
floating around, I think it needs a good home.

But the glue-layer is under "Apache 2.0" license, and that's the part that
needed fixing.  I think that means it's ASF / contrib compatible?

There are also 2 other ancillary libraries and some source dictionaries
which I need to research.

Working from the other direction, which might give you some ideas:
The goal is to get this more accessible.  It'd be really nice if, in a
Lucene distribution, the SEN library could be switched on as easily as CJK.
Or at the most you'd run an ant script to fetch all the parts and assemble
it.  As it stands now I think it's not used much because it's a bit complex
to setup, even prior to May '09's change, and most of the users of it
discuss it in Japanese.  So that's the goal, I'm very open to ideas on the
"how".

Mark

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Mon, Oct 12, 2009 at 11:10 AM, Robert Muir <rcmuir@gmail.com> wrote:

> Mark, does this mean Sen will be under the Apache license? (it is currently
> LGPL)
>
>
> On Mon, Oct 12, 2009 at 1:46 PM, Mark Bennett <mbennett@ideaeng.com>wrote:
>
>> Hi folks,
>>
>> I've been working to fix the Japanese SEN morphological analyzer, which is
>> currently hosted at:
>> https://sen.dev.java.net
>>
>> To review, Japanese doesn't use whitespace for word breaks.  The
>> traditional approach to CJK (Chinese, Japanese, Korean) is to use bigram
>> character pairs in the index.  While this works to a point, some believe
>> that using proper word breaks provides better results.
>>
>> The "lucene-ja" glue layer between Lucene and the core SEN library broke
>> in May of '09 when a fix was made in Lucene:
>> http://issues.apache.org/jira/browse/LUCENE-1636
>>
>> Uwe S. had a very good insight for a quick fix, and I have been cleaning
>> up some other issues with the code.  I have also spoken the author Takashi
>> Okamoto and he is fine to have this moved from java.net to ASF; I think
>> it will be easier for folks to find and use it if it's in ASF.
>>
>> I'm not quite ready to submit a patch, but the Wiki suggests emailing the
>> list with the idea in advance.  There are some packaging questions I'll
>> have, there's actually quite a few parts.  Also, the wiki didn't quite spell
>> out the process to get things into contrib, beyond emailing and submitting a
>> patch.  I also plan to eventually submit a Solr-specific wrapper to the solr
>> dev list, to allow for dynamic config changes to be made from Solr's
>> schema.  But since the original code was Lucene based, and it provides the
>> broadest reach, I think having it in core Lucene would be a good start.
>>
>> Any comments, suggestions, or mentor volunteers?  :-)
>>
>> Mark
>>
>> --
>> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
>> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Mime
View raw message