Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 43106 invoked from network); 7 Nov 2010 16:23:57 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 Nov 2010 16:23:57 -0000 Received: (qmail 51222 invoked by uid 500); 7 Nov 2010 16:24:27 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 51172 invoked by uid 500); 7 Nov 2010 16:24:27 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 51161 invoked by uid 99); 7 Nov 2010 16:24:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Nov 2010 16:24:27 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 07 Nov 2010 16:24:26 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id oA7GO5js008813 for ; Sun, 7 Nov 2010 16:24:06 GMT Message-ID: <2702663.63371289147045738.JavaMail.jira@thor> Date: Sun, 7 Nov 2010 11:24:05 -0500 (EST) From: "Robert Muir (JIRA)" To: dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929363#action_12929363 ] Robert Muir commented on LUCENE-2167: ------------------------------------- bq. because when people want full URLs, they can't be reassembled after the separator chars are thrown away by the tokenizer. Well, i dont much like this argument, because its true about anything. Indexing text for search is a lossy thing by definition. yeah, when you tokenize this stuff, you lose paragraphs, sentences, all kinds of things. should we output whole paragraphs as tokens so its not lost? bq. Robert, when I mentioned the decomposition filter, you said you didn't like that idea. Do you still feel the same? Well, i said it was a can of worms, i still feel that it is complicated, yes. But i mean we do have a ghetto decomposition filter (WordDelimiterFilter) already. And someone can use this with the UAX#29+URLRecognizingTokenizer to index these urls in a variety of ways, including preserving the original full url too. bq. Would a URL decomposition filter, with full URL emission turned off by default, work here? It works in theory, but its confusing that its 'required' to not get absymal tokens. i would prefer we switch the situation around: make UAX#29 'standardtokenizer' and give the uax#29+url+email+ip+... a different name. because to me, uax#29 handles urls in nice ways, e.g. my user types 'facebook' and they get back facebook.com its certainly simple and won't blow up terms dictionaries... otherwise, creating lots of long, unique tokens (urls) by default is a serious performance trap, particularly for lucene 3.x > Implement StandardTokenizer with the UAX#29 Standard > ---------------------------------------------------- > > Key: LUCENE-2167 > URL: https://issues.apache.org/jira/browse/LUCENE-2167 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers > Affects Versions: 3.1, 4.0 > Reporter: Shyamal Prasad > Assignee: Steven Rowe > Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-lucene-buildhelper-maven-plugin.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense. > Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims: > bq. This should be a good tokenizer for most European-language documents > The new StandardTokenizer could then say > bq. This should be a good tokenizer for most languages. > All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org