Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 60369 invoked from network); 29 Oct 2009 18:08:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Oct 2009 18:08:25 -0000 Received: (qmail 47423 invoked by uid 500); 29 Oct 2009 18:08:24 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 47343 invoked by uid 500); 29 Oct 2009 18:08:24 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 47335 invoked by uid 99); 29 Oct 2009 18:08:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Oct 2009 18:08:24 +0000 X-ASF-Spam-Status: No, hits=-10.5 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Oct 2009 18:08:19 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 5D716234C045 for ; Thu, 29 Oct 2009 11:07:59 -0700 (PDT) Message-ID: <1281715634.1256839679367.JavaMail.jira@brutus> Date: Thu, 29 Oct 2009 18:07:59 +0000 (UTC) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2016) replace invalid U+FFFF character during indexing In-Reply-To: <1623336436.1256836019390.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771525#action_12771525 ] Michael McCandless commented on LUCENE-2016: -------------------------------------------- bq. Finally, I completely disagree with the nontrivial performance comment. The trick is to make sure the execution branch / checks for the process-internal characters outside the bmp, only occurs for surrogate pairs. They are statistically very rare and if done right, it will not affect performance of BMP content. OK I agree, you're right: we could in fact do this with negligible impact to performance. bq. Its my understanding Lucene indexes should be portable to different programming languages: perhaps my implementation in C/perl/python decides to use a different process-internal character, this is allowed by Unicode and I think we should adhere to it, I don't think its being anal. But if we forcefully map all invalid-for-interchange unicode characters to the replacement character (I think that's what's being proposed, right?), then your app no longer has any characters it can use for its own "internal" purposes? Can you open a new issue to track this? This is a wider discussion than preventing index corruption :) > replace invalid U+FFFF character during indexing > ------------------------------------------------ > > Key: LUCENE-2016 > URL: https://issues.apache.org/jira/browse/LUCENE-2016 > Project: Lucene - Java > Issue Type: Bug > Affects Versions: 2.4, 2.4.1, 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 2.9.1, 3.0 > > Attachments: LUCENE-2016.patch > > > If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently corrupt the index by writing duplicate terms into the terms dict. CheckIndex will catch the error, and merging will hit exceptions (I think). > We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll just do the same with U+FFFF. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org