Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 42816 invoked from network); 12 Jun 2010 15:23:40 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Jun 2010 15:23:40 -0000 Received: (qmail 22753 invoked by uid 500); 12 Jun 2010 15:23:39 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 22713 invoked by uid 500); 12 Jun 2010 15:23:38 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 22706 invoked by uid 99); 12 Jun 2010 15:23:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Jun 2010 15:23:38 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Jun 2010 15:23:35 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o5CFNEtn006508 for ; Sat, 12 Jun 2010 15:23:14 GMT Message-ID: <19268409.63361276356194055.JavaMail.jira@thor> Date: Sat, 12 Jun 2010 11:23:14 -0400 (EDT) From: "Steven Rowe (JIRA)" To: dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878274#action_12878274 ] Steven Rowe commented on LUCENE-2167: ------------------------------------- {quote} bq. Interesting paper. With syllable n-grams (in Tibetan anyway), you trade off (quadrupled) index size for word segmentation, but otherwise, these work equally well. Careful, the way they did the measurement only tells us that neither one is absolute shit, but i dont think its clear yet they are equal. either way, the argument in the paper is for bigrams (n=2)... {quote} Yes, you're right - fine-grained performance comparisons are inappropriate here. You've said for other language(s?) that unigram/bigram combo works best - too bad they didn't test that here. bq. how is this quadrupled index size? its just like CJKTokenizer... >From the paper: {quote} As has been observed in other languages [Miller et al., 2000], ngram indexing resulted in explosive growth in the number of terms with increasing n. The index size for word-based indexing was less than one quarter of that of syllable bigrams. {quote} bq. In general i'd like to think that UAX#29 sentence segmentation, implemented nicely, would be a cool feature that could help with some of these problems, and maybe other problems too. You mentioned it would be useful to eliminate phrase matches across sentence boundaries - what other problems would it solve? > Implement StandardTokenizer with the UAX#29 Standard > ---------------------------------------------------- > > Key: LUCENE-2167 > URL: https://issues.apache.org/jira/browse/LUCENE-2167 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers > Affects Versions: 3.1 > Reporter: Shyamal Prasad > Assignee: Steven Rowe > Priority: Minor > Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-lucene-buildhelper-maven-plugin.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense. > Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims: > bq. This should be a good tokenizer for most European-language documents > The new StandardTokenizer could then say > bq. This should be a good tokenizer for most languages. > All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org