Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ECA919BB0 for ; Mon, 2 Apr 2012 10:33:47 +0000 (UTC) Received: (qmail 91925 invoked by uid 500); 2 Apr 2012 10:33:46 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 91868 invoked by uid 500); 2 Apr 2012 10:33:46 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 91860 invoked by uid 99); 2 Apr 2012 10:33:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Apr 2012 10:33:46 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Apr 2012 10:33:43 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 4B52A1C368C for ; Mon, 2 Apr 2012 10:33:22 +0000 (UTC) Date: Mon, 2 Apr 2012 10:33:22 +0000 (UTC) From: "Christian Moen (Commented) (JIRA)" To: dev@lucene.apache.org Message-ID: <64707800.159.1333362802310.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1589632802.40780.1333152453105.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244111#comment-13244111 ] Christian Moen commented on LUCENE-3940: ---------------------------------------- I'm not familiar with the various considerations that were made with StandardTokenizer, but please allow me to share some comments anyway. Perhaps it's useful to distinguish between _analysis for information retrieval_ and _analysis for information extraction_ here? I like Michael's and Steven's idea of doing tokenization that doesn't discard any information. This is certainly useful in the case of _information extraction_. For example, if we'd like to extract noun-phrases based on part-of-speech tags, we don't want to conjoin tokens in case there's a punctuation character between two nouns (unless that punctuation character is a middle dot). Robert is of course correct that we generally don't want to index punctuation characters that occur in every document, so from an _information retrieval_ point of view, we'd like punctuation characters removed. If there's an established convention that Tokenizer variants discards punctuation and produces the terms that are meant to be directly searchable, it sounds like a good idea that we stick to the convention here as well. If there's no established convention, it seems useful that a Tokenizer would provide as much details as possible with text being input and leave downstream Filters/Analyzers to remove whatever is suitable based on a particular processing purpose. We can provide common ready-to-use Analyzers with reasonable defaults that users can look to, i.e. to process a specific language or do another common high-level task with text. Hence, perhaps each Tokenizer can decide what makes the most sense to do based on that particular tokenizer's scope of processing? To Roberts point, this would leave processing totally arbitrary and consistent, but this would be _by design_ as it wouldn't be Tokenizer's role to enforce any overall consistency -- i.e. with regards to punctuation -- higher level Analyzers would provide that. Thoughts? > When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole > ------------------------------------------------------------------------------------- > > Key: LUCENE-3940 > URL: https://issues.apache.org/jira/browse/LUCENE-3940 > Project: Lucene - Java > Issue Type: Bug > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 4.0 > > Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch > > > I modified BaseTokenStreamTestCase to assert that the start/end > offsets match for graph (posLen > 1) tokens, and this caught a bug in > Kuromoji when the decompounding of a compound token has a punctuation > token that's dropped. > In this case we should leave hole(s) so that the graph is intact, ie, > the graph should look the same as if the punctuation tokens were not > initially removed, but then a StopFilter had removed them. > This also affects tokens that have no compound over them, ie we fail > to leave a hole today when we remove the punctuation tokens. > I'm not sure this is serious enough to warrant fixing in 3.6 at the > last minute... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org