Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 92755 invoked from network); 13 Jun 2009 14:23:21 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Jun 2009 14:23:21 -0000 Received: (qmail 44808 invoked by uid 500); 13 Jun 2009 14:23:32 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 44735 invoked by uid 500); 13 Jun 2009 14:23:32 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 44727 invoked by uid 99); 13 Jun 2009 14:23:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Jun 2009 14:23:32 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Jun 2009 14:23:28 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 89C6E234C052 for ; Sat, 13 Jun 2009 07:23:07 -0700 (PDT) Message-ID: <697676306.1244902987563.JavaMail.jira@brutus> Date: Sat, 13 Jun 2009 07:23:07 -0700 (PDT) From: "Simon Willnauer (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1689) supplementary character handling In-Reply-To: <899275286.1244831767426.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719125#action_12719125 ] Simon Willnauer commented on LUCENE-1689: ----------------------------------------- The scary thing is that this happens already if you run lucene on a 1.5 VM even without introducing 1.5 code. I think we need to act on this issue asap and release it together with 3.0. -> ful support for unicode 4.0 in lucene 3.0 I also thought about the implementation a little bit. The need to detect chars > BMP and operate on those might be spread out across lucene (quite a couple of analyzers and filters etc). Performance could truely suffer from this if it is done "wrong" or even more than once. It might be considerable to make the detection pluggable with an initial filter that only checks where surrogates are present in a token and sets an indicator to the token represenation so that subsequent TokenStreams can operate on it without rechecking. This would also preserve performance for those who do not need chars > BMP (which could be quite a large amout of people). Those could then simply not supply such a initial filter. Just a couple of random thoughts. > supplementary character handling > -------------------------------- > > Key: LUCENE-1689 > URL: https://issues.apache.org/jira/browse/LUCENE-1689 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Robert Muir > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1689_lowercase_example.txt > > > for Java 5. Java 5 is based on unicode 4, which means variable-width encoding. > supplementary character support should be fixed for code that works with char/char[] > For example: > StandardAnalyzer, SimpleAnalyzer, StopAnalyzer, etc should at least be changed so they don't actually remove suppl characters, or modified to look for surrogates and behave correctly. > LowercaseFilter should be modified to lowercase suppl. characters correctly. > CharTokenizer should either be deprecated or changed so that isTokenChar() and normalize() use int. > in all of these cases code should remain optimized for the BMP case, and suppl characters should be the exception, but still work. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org