lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3726) Default KuromojiAnalyzer to use search mode
Date Wed, 01 Feb 2012 10:08:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197730#comment-13197730
] 

Christian Moen commented on LUCENE-3726:
----------------------------------------

I've segmented some Japanese Wikipedia text into sentences (using a naive sentence segmenter)
and then segmented each sentence using both normal and search mode with the Kuromoji on Github
that has LUCENE-3730 applied.  Segmentation with Kuromoji in Lucene should be similar overall
(modulo some differences in punctuation handling).

Search mode and normal mode segmentation match completely in 90.7% of the sentences segmented
and there's a 99.6% match at the token level (when counting normal mode tokens).

Find attached some HTML files with a total of 10,000 sentences that demonstrates the differences
in segmentation.

Overall, I think search mode does a decent job.  I've written someone else doing Japanese
NLP to get their second opinion, in particular if the kanji splitting should be made somewhat
less eager to split three letter words.
                
> Default KuromojiAnalyzer to use search mode
> -------------------------------------------
>
>                 Key: LUCENE-3726
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3726
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.6, 4.0
>            Reporter: Robert Muir
>         Attachments: kuromojieval.tar.gz
>
>
> Kuromoji supports an option to segment text in a way more suitable for search,
> by preventing long compound nouns as indexing terms.
> In general 'how you segment' can be important depending on the application 
> (see http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf for some studies on this in chinese)
> The current algorithm punishes the cost based on some parameters (SEARCH_MODE_PENALTY,
SEARCH_MODE_LENGTH, etc)
> for long runs of kanji.
> Some questions (these can be separate future issues if any useful ideas come out):
> * should these parameters continue to be static-final, or configurable?
> * should POS also play a role in the algorithm (can/should we refine exactly what we
decompound)?
> * is the Tokenizer the best place to do this, or should we do it in a tokenfilter? or
both?
>   with a tokenfilter, one idea would be to also preserve the original indexing term,
overlapping it: e.g. ABCD -> AB, CD, ABCD(posInc=0)
>   from my understanding this tends to help with noun compounds in other languages, because
IDF of the original term boosts 'exact' compound matches.
>   but does a tokenfilter provide the segmenter enough 'context' to do this properly?
> Either way, I think as a start we should turn on what we have by default: its likely
a very easy win.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message