Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Wed, 1 Feb 2012 10:08:59 +0000 (UTC)
From: "Christian Moen (Commented) (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: 
 <1144800633.2198.1328090939130.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <1002573680.84136.1327636060589.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (LUCENE-3726) Default KuromojiAnalyzer to use
 search mode
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197730#comment-13197730 ] 

Christian Moen commented on LUCENE-3726:
----------------------------------------

I've segmented some Japanese Wikipedia text into sentences (using a naive sentence segmenter) and then segmented each sentence using both normal and search mode with the Kuromoji on Github that has LUCENE-3730 applied.  Segmentation with Kuromoji in Lucene should be similar overall (modulo some differences in punctuation handling).

Search mode and normal mode segmentation match completely in 90.7% of the sentences segmented and there's a 99.6% match at the token level (when counting normal mode tokens).

Find attached some HTML files with a total of 10,000 sentences that demonstrates the differences in segmentation.

Overall, I think search mode does a decent job.  I've written someone else doing Japanese NLP to get their second opinion, in particular if the kanji splitting should be made somewhat less eager to split three letter words.
                
> Default KuromojiAnalyzer to use search mode
> -------------------------------------------
>
>                 Key: LUCENE-3726
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3726
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.6, 4.0
>            Reporter: Robert Muir
>         Attachments: kuromojieval.tar.gz
>
>
> Kuromoji supports an option to segment text in a way more suitable for search,
> by preventing long compound nouns as indexing terms.
> In general 'how you segment' can be important depending on the application 
> (see http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf for some studies on this in chinese)
> The current algorithm punishes the cost based on some parameters (SEARCH_MODE_PENALTY, SEARCH_MODE_LENGTH, etc)
> for long runs of kanji.
> Some questions (these can be separate future issues if any useful ideas come out):
> * should these parameters continue to be static-final, or configurable?
> * should POS also play a role in the algorithm (can/should we refine exactly what we decompound)?
> * is the Tokenizer the best place to do this, or should we do it in a tokenfilter? or both?
>   with a tokenfilter, one idea would be to also preserve the original indexing term, overlapping it: e.g. ABCD -> AB, CD, ABCD(posInc=0)
>   from my understanding this tends to help with noun compounds in other languages, because IDF of the original term boosts 'exact' compound matches.
>   but does a tokenfilter provide the segmenter enough 'context' to do this properly?
> Either way, I think as a start we should turn on what we have by default: its likely a very easy win.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org