Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 16230 invoked from network); 13 Jun 2009 09:47:21 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Jun 2009 09:47:21 -0000 Received: (qmail 94340 invoked by uid 500); 13 Jun 2009 09:47:31 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 94237 invoked by uid 500); 13 Jun 2009 09:47:31 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 94229 invoked by uid 99); 13 Jun 2009 09:47:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Jun 2009 09:47:31 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Jun 2009 09:47:28 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 7085B234C004 for ; Sat, 13 Jun 2009 02:47:07 -0700 (PDT) Message-ID: <1335420876.1244886427446.JavaMail.jira@brutus> Date: Sat, 13 Jun 2009 02:47:07 -0700 (PDT) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1689) supplementary character handling In-Reply-To: <899275286.1244831767426.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719104#action_12719104 ] Michael McCandless commented on LUCENE-1689: -------------------------------------------- Robert, could you flesh this patch out to a committable point? Ie, handle unpaired surrogates, add test case that first shows that LowercaseFilter incorrectly breaks up surrogates? Thanks! bq. it depends upon the knowledge that no surrogate pairs lowercase to BMP codepoints Is it invalid to make this assumption? Ie, does the unicode standard not guarantee it? > supplementary character handling > -------------------------------- > > Key: LUCENE-1689 > URL: https://issues.apache.org/jira/browse/LUCENE-1689 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Robert Muir > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1689_lowercase_example.txt > > > for Java 5. Java 5 is based on unicode 4, which means variable-width encoding. > supplementary character support should be fixed for code that works with char/char[] > For example: > StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly. > LowercaseFilter should be modified to lowercase suppl. characters correctly. > CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int. > in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org