Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 69EDD9099 for ; Wed, 1 Feb 2012 10:09:37 +0000 (UTC) Received: (qmail 24566 invoked by uid 500); 1 Feb 2012 10:09:35 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 24383 invoked by uid 500); 1 Feb 2012 10:09:23 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 24368 invoked by uid 99); 1 Feb 2012 10:09:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Feb 2012 10:09:21 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Feb 2012 10:09:19 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 1F7E6183A2F for ; Wed, 1 Feb 2012 10:08:59 +0000 (UTC) Date: Wed, 1 Feb 2012 10:08:59 +0000 (UTC) From: "Christian Moen (Commented) (JIRA)" To: dev@lucene.apache.org Message-ID: <1144800633.2198.1328090939130.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1002573680.84136.1327636060589.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (LUCENE-3726) Default KuromojiAnalyzer to use search mode MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197730#comment-13197730 ] Christian Moen commented on LUCENE-3726: ---------------------------------------- I've segmented some Japanese Wikipedia text into sentences (using a naive sentence segmenter) and then segmented each sentence using both normal and search mode with the Kuromoji on Github that has LUCENE-3730 applied. Segmentation with Kuromoji in Lucene should be similar overall (modulo some differences in punctuation handling). Search mode and normal mode segmentation match completely in 90.7% of the sentences segmented and there's a 99.6% match at the token level (when counting normal mode tokens). Find attached some HTML files with a total of 10,000 sentences that demonstrates the differences in segmentation. Overall, I think search mode does a decent job. I've written someone else doing Japanese NLP to get their second opinion, in particular if the kanji splitting should be made somewhat less eager to split three letter words. > Default KuromojiAnalyzer to use search mode > ------------------------------------------- > > Key: LUCENE-3726 > URL: https://issues.apache.org/jira/browse/LUCENE-3726 > Project: Lucene - Java > Issue Type: Improvement > Affects Versions: 3.6, 4.0 > Reporter: Robert Muir > Attachments: kuromojieval.tar.gz > > > Kuromoji supports an option to segment text in a way more suitable for search, > by preventing long compound nouns as indexing terms. > In general 'how you segment' can be important depending on the application > (see http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf for some studies on this in chinese) > The current algorithm punishes the cost based on some parameters (SEARCH_MODE_PENALTY, SEARCH_MODE_LENGTH, etc) > for long runs of kanji. > Some questions (these can be separate future issues if any useful ideas come out): > * should these parameters continue to be static-final, or configurable? > * should POS also play a role in the algorithm (can/should we refine exactly what we decompound)? > * is the Tokenizer the best place to do this, or should we do it in a tokenfilter? or both? > with a tokenfilter, one idea would be to also preserve the original indexing term, overlapping it: e.g. ABCD -> AB, CD, ABCD(posInc=0) > from my understanding this tends to help with noun compounds in other languages, because IDF of the original term boosts 'exact' compound matches. > but does a tokenfilter provide the segmenter enough 'context' to do this properly? > Either way, I think as a start we should turn on what we have by default: its likely a very easy win. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org