Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 69145 invoked from network); 10 May 2010 19:26:56 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 10 May 2010 19:26:56 -0000 Received: (qmail 4401 invoked by uid 500); 10 May 2010 19:26:55 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 4310 invoked by uid 500); 10 May 2010 19:26:55 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 4302 invoked by uid 99); 10 May 2010 19:26:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 May 2010 19:26:55 +0000 X-ASF-Spam-Status: No, hits=-1409.7 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 May 2010 19:26:54 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o4AJQXV8025819 for ; Mon, 10 May 2010 19:26:33 GMT Message-ID: <9686967.2041273519593488.JavaMail.jira@thor> Date: Mon, 10 May 2010 15:26:33 -0400 (EDT) From: "Robert Muir (JIRA)" To: dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865879#action_12865879 ] Robert Muir commented on LUCENE-2167: ------------------------------------- bq. A filter that breaks URL type tokens into their components, and then adds them as overlapping tokens, or replaces the full URL with the components, should be easy to write, though. Not sure, for this to really work for non-english, it should recognize and normalize punycode representations of international domain names, etc. So while its a good idea, maybe it is a can of worms, and better to leave it alone for now? > Implement StandardTokenizer with the UAX#29 Standard > ---------------------------------------------------- > > Key: LUCENE-2167 > URL: https://issues.apache.org/jira/browse/LUCENE-2167 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers > Affects Versions: 3.1 > Reporter: Shyamal Prasad > Assignee: Steven Rowe > Priority: Minor > Attachments: LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense. > Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims: > bq. This should be a good tokenizer for most European-language documents > The new StandardTokenizer could then say > bq. This should be a good tokenizer for most languages. > All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org