Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 43086 invoked from network); 9 Nov 2010 13:56:01 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 9 Nov 2010 13:56:01 -0000 Received: (qmail 56582 invoked by uid 500); 9 Nov 2010 13:56:32 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 56189 invoked by uid 500); 9 Nov 2010 13:56:30 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 56182 invoked by uid 99); 9 Nov 2010 13:56:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Nov 2010 13:56:29 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Nov 2010 13:56:29 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id oA9Du8wS016673 for ; Tue, 9 Nov 2010 13:56:09 GMT Message-ID: <15701950.101831289310968771.JavaMail.jira@thor> Date: Tue, 9 Nov 2010 08:56:08 -0500 (EST) From: "DM Smith (JIRA)" To: dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer In-Reply-To: <17755171.80791289246946346.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930119#action_12930119 ] DM Smith commented on LUCENE-2747: ---------------------------------- bq. DM, can you elaborate here? I was a bit trigger happy with the comment. I should have looked at the code rather than the jira comments alone. The old StandardAnalyzer had a kitchen sink approach to tokenizations trying to do too much with *modern* constructs, e.g. URLs, email addresses, acronyms.... It and SimpleAnalyzer would produce about the same stream on "old" English and some other texts, but the StandardAnalyzer was much slower. (I don't remember how slow, but it was obvious.) Both of these were weak when it came to non-English/non-Western texts. Thus I could take the language specific tokenizers, lists of stop words, stemmers and create variations of the SimpleAnalyzer that properly handled a particular language. (I created my own analyzers because I wanted to make stop words and stemming optional) In looking at the code in trunk (should have done that before making my comment), I see that UAX29Tokenizer is duplicated in the StandardAnalyzer's jflex and that ClassicAnalyzer is the old jflex. Also, the new StandardAnalyzer does a lot less. If I understand the suggestion of this and the other 2 issues, StandardAnalyzer will no longer handle modern constructs. As I see it this is what SimpleAnalyzer should be: Based on UAX29 and does little else. Thus my confusion. Is there a point to having SimpleAnalyzer? Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?) And if I understand where this is going: Would there be a way to plugin ICUTokenizer as a replacement for UAX29Tokenizer into StandardTokenizer, such that all Analyzers using StandardTokenizer would get the alternate implementation? > Deprecate/remove language-specific tokenizers in favor of StandardTokenizer > --------------------------------------------------------------------------- > > Key: LUCENE-2747 > URL: https://issues.apache.org/jira/browse/LUCENE-2747 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Affects Versions: 3.1, 4.0 > Reporter: Steven Rowe > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2747.patch, LUCENE-2747.patch > > > As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization. Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0). The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters. The language-specific analyzers should switch to StandardTokenizer in 3.1. > Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer. > For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary. Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org