Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 53512 invoked from network); 7 Nov 2010 16:52:00 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 Nov 2010 16:52:00 -0000 Received: (qmail 73791 invoked by uid 500); 7 Nov 2010 16:52:31 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 73624 invoked by uid 500); 7 Nov 2010 16:52:30 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 73617 invoked by uid 99); 7 Nov 2010 16:52:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Nov 2010 16:52:30 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Nov 2010 16:52:30 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id oA7Gq9uS008930 for ; Sun, 7 Nov 2010 16:52:09 GMT Message-ID: <27692443.63511289148729084.JavaMail.jira@thor> Date: Sun, 7 Nov 2010 11:52:09 -0500 (EST) From: "Michael McCandless (JIRA)" To: dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929368#action_12929368 ] Michael McCandless commented on LUCENE-2167: -------------------------------------------- Would it somehow be possible to allow multiple Tokenizers to work together? Today we only allow one (and then any number of subsequent TokenFilters) in the chain, so if your Tokenizer destroys information (eg erases the . from the host name) it's hard for subsequent TokenFilters to put them back. But if, say, we had a Tokenizer that recognizes hostnames/URLs, one that recognizes email addresses, one for proper names/places/date/time, other app dependent stuff like detecting part numbers and what not, then ideally one could simply cascade/compose these tokenizers at will to build up whatever "initial" tokenizer you need for you chain? I think our current lack of composability of the initial tokenizer ("there can be only one") makes cases like this hard... > Implement StandardTokenizer with the UAX#29 Standard > ---------------------------------------------------- > > Key: LUCENE-2167 > URL: https://issues.apache.org/jira/browse/LUCENE-2167 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers > Affects Versions: 3.1, 4.0 > Reporter: Shyamal Prasad > Assignee: Steven Rowe > Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-lucene-buildhelper-maven-plugin.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense. > Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims: > bq. This should be a good tokenizer for most European-language documents > The new StandardTokenizer could then say > bq. This should be a good tokenizer for most languages. > All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org