Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 59931 invoked from network); 9 Nov 2010 11:11:59 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 9 Nov 2010 11:11:59 -0000 Received: (qmail 48585 invoked by uid 500); 9 Nov 2010 11:12:29 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 48468 invoked by uid 500); 9 Nov 2010 11:12:29 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 48460 invoked by uid 99); 9 Nov 2010 11:12:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Nov 2010 11:12:28 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Nov 2010 11:12:28 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id oA9BC7hq013635 for ; Tue, 9 Nov 2010 11:12:07 GMT Message-ID: <9006480.100781289301127509.JavaMail.jira@thor> Date: Tue, 9 Nov 2010 06:12:07 -0500 (EST) From: "Robert Muir (JIRA)" To: dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer In-Reply-To: <17755171.80791289246946346.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930090#action_12930090 ] Robert Muir commented on LUCENE-2747: ------------------------------------- bq. I'm not too keen on this. For classics and ancient texts the standard analyzer is not as good as the simple analyzer. DM, can you elaborate here? Are you speaking of the existing StandardAnalyzer in previous releases, that doesn't properly deal with tokenizing diacritics, etc? This is the reason these "special" tokenizers exist: to work around those bugs. but StandardTokenizer now handles this stuff fine, and they are obselete. I'm confused though, in previous releases how SimpleAnalyzer would ever be any better, since it would barf on these diacritics too, it only emits tokens that are runs of Character.isLetter Or is there something else i'm missing here? > Deprecate/remove language-specific tokenizers in favor of StandardTokenizer > --------------------------------------------------------------------------- > > Key: LUCENE-2747 > URL: https://issues.apache.org/jira/browse/LUCENE-2747 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Affects Versions: 3.1, 4.0 > Reporter: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2747.patch > > > As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization. Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0). The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters. The language-specific analyzers should switch to StandardTokenizer in 3.1. > Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer. > For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary. Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org