Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 82783 invoked from network); 9 Nov 2010 00:30:01 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 9 Nov 2010 00:30:01 -0000 Received: (qmail 87846 invoked by uid 500); 9 Nov 2010 00:30:31 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 87805 invoked by uid 500); 9 Nov 2010 00:30:31 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 87798 invoked by uid 99); 9 Nov 2010 00:30:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Nov 2010 00:30:31 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED,T_FILL_THIS_FORM_SHORT X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Nov 2010 00:30:29 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id oA90U6uo000734 for ; Tue, 9 Nov 2010 00:30:08 GMT Message-ID: <16342584.90111289262606925.JavaMail.jira@thor> Date: Mon, 8 Nov 2010 19:30:06 -0500 (EST) From: "Robert Muir (JIRA)" To: dev@lucene.apache.org Subject: [jira] Commented: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support In-Reply-To: <6255994.176671288632744834.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/SOLR-2211?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D12929= 849#action_12929849 ]=20 Robert Muir commented on SOLR-2211: ----------------------------------- Great, I look forward to the results. By the way, on SOLR-2210 i also added the ICU filters, you could consider r= eplacing LowerCaseFilterFactory with ICUNormalizer2Factory (just use the de= faults). In addition to better lowercasing (e.g. =C3=9F -> ss), this would also brin= g the advantages described in http://unicode.org/reports/tr15/ Alternatively, if you are already using both LowerCaseFilterFactory and ASC= IIFoldingFilterFactory, you can replace both with ICUFoldingFilterFactory, which goes further and also incorporates http://www.unicode.org/reports/tr3= 0/tr30-4.html > Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 suppo= rt > -------------------------------------------------------------------------= -- > > Key: SOLR-2211 > URL: https://issues.apache.org/jira/browse/SOLR-2211 > Project: Solr > Issue Type: New Feature > Affects Versions: 3.1 > Reporter: Tom Burton-West > Assignee: Robert Muir > Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: SOLR-2211.patch > > > The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits fo= r non-English tokenizing. Presently it can be invoked by using the Standar= dTokenizerFactory and setting the Version to 3.1. However, it would be use= ful to be able to use the improved unicode processing without necessarily i= ncluding the ip address and email address processing of StandardAnalyzer. = A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 = support on its own would be useful. --=20 This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org