lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Fix for Japanese SEN morphological analyzer, and moving into Contrib
Date Mon, 12 Oct 2009 19:11:18 GMT
Mark, I agree with what you said, it would be great if there was a way to
easily enable this japanese support.

I will let someone else comment on the licensing, but I since you mentioned
source dictionaries, thought Sen only used IPA dic for its data? I could be
wrong on this.

I think its a BSD-like license, here you can read the license as google
chrome prints it... (in a separate really interesting dictionary for CJ
segmentation)

http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/icu38/source/data/brkitr/cjdict.txt

On Mon, Oct 12, 2009 at 2:58 PM, Mark Bennett <mbennett@ideaeng.com> wrote:

> Hello Robert,
>
> That's a good question.  The core SEN is under LGPL, yes.  However, I
> didn't need to make changes to that, though given that there are 2 versions
> floating around, I think it needs a good home.
>
> But the glue-layer is under "Apache 2.0" license, and that's the part that
> needed fixing.  I think that means it's ASF / contrib compatible?
>
> There are also 2 other ancillary libraries and some source dictionaries
> which I need to research.
>
> Working from the other direction, which might give you some ideas:
> The goal is to get this more accessible.  It'd be really nice if, in a
> Lucene distribution, the SEN library could be switched on as easily as CJK.
> Or at the most you'd run an ant script to fetch all the parts and assemble
> it.  As it stands now I think it's not used much because it's a bit complex
> to setup, even prior to May '09's change, and most of the users of it
> discuss it in Japanese.  So that's the goal, I'm very open to ideas on the
> "how".
>
> Mark
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>
>
> On Mon, Oct 12, 2009 at 11:10 AM, Robert Muir <rcmuir@gmail.com> wrote:
>
>> Mark, does this mean Sen will be under the Apache license? (it is
>> currently LGPL)
>>
>>
>> On Mon, Oct 12, 2009 at 1:46 PM, Mark Bennett <mbennett@ideaeng.com>wrote:
>>
>>> Hi folks,
>>>
>>> I've been working to fix the Japanese SEN morphological analyzer, which
>>> is currently hosted at:
>>> https://sen.dev.java.net
>>>
>>> To review, Japanese doesn't use whitespace for word breaks.  The
>>> traditional approach to CJK (Chinese, Japanese, Korean) is to use bigram
>>> character pairs in the index.  While this works to a point, some believe
>>> that using proper word breaks provides better results.
>>>
>>> The "lucene-ja" glue layer between Lucene and the core SEN library broke
>>> in May of '09 when a fix was made in Lucene:
>>> http://issues.apache.org/jira/browse/LUCENE-1636
>>>
>>> Uwe S. had a very good insight for a quick fix, and I have been cleaning
>>> up some other issues with the code.  I have also spoken the author Takashi
>>> Okamoto and he is fine to have this moved from java.net to ASF; I think
>>> it will be easier for folks to find and use it if it's in ASF.
>>>
>>> I'm not quite ready to submit a patch, but the Wiki suggests emailing the
>>> list with the idea in advance.  There are some packaging questions I'll
>>> have, there's actually quite a few parts.  Also, the wiki didn't quite spell
>>> out the process to get things into contrib, beyond emailing and submitting a
>>> patch.  I also plan to eventually submit a Solr-specific wrapper to the solr
>>> dev list, to allow for dynamic config changes to be made from Solr's
>>> schema.  But since the original code was Lucene based, and it provides the
>>> broadest reach, I think having it in core Lucene would be a good start.
>>>
>>> Any comments, suggestions, or mentor volunteers?  :-)
>>>
>>> Mark
>>>
>>> --
>>> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
>>> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
View raw message