Return-Path: X-Original-To: apmail-incubator-lucy-user-archive@www.apache.org Delivered-To: apmail-incubator-lucy-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C9AFF7C77 for ; Wed, 12 Oct 2011 17:52:03 +0000 (UTC) Received: (qmail 15952 invoked by uid 500); 12 Oct 2011 17:52:03 -0000 Delivered-To: apmail-incubator-lucy-user-archive@incubator.apache.org Received: (qmail 15924 invoked by uid 500); 12 Oct 2011 17:52:03 -0000 Mailing-List: contact lucy-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-user@incubator.apache.org Delivered-To: mailing list lucy-user@incubator.apache.org Received: (qmail 15914 invoked by uid 99); 12 Oct 2011 17:52:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Oct 2011 17:52:03 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [68.116.39.62] (HELO rectangular.com) (68.116.39.62) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Oct 2011 17:51:56 +0000 Received: from marvin by rectangular.com with local (Exim 4.69) (envelope-from ) id 1RE2tS-0002VJ-91; Wed, 12 Oct 2011 10:47:02 -0700 Date: Wed, 12 Oct 2011 10:47:02 -0700 From: Marvin Humphrey To: lucy-user@incubator.apache.org Cc: peter@peknet.com Message-ID: <20111012174702.GA9586@rectangular.com> References: <4E95A611.3050100@peknet.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-user] Different UTF-8 behaviour between perl 5.8.8 (indexes ok) and 5.10.1 (indexing fails) On Wed, Oct 12, 2011 at 05:57:33PM +0200, goran kent wrote: > > It sounds like, without seeing a reproduce-able test case, that Lucy is > > choking appropriately on malformed UTF-8. > > Absolutely. What's interesting is that the same Lucy code does not > choke on the other machines with the older Perl. Lucy trusts that incoming data it has received from Perl is well-formed. (Technically, it assumes that string data obtained via the XS routine SvPVutf8() is well-formed UTF-8, notwithstanding the difference between Perl's loose internal representation and the Unicode standard for UTF-8.) We could add an index-time validity check, but that would slow down indexing. At search-time, though, Lucy is reading from the file system rather than receiving data from Perl -- and data from the file system cannot be trusted. Therefore, Lucy always performs validity checks when reading what is ostensibly UTF-8 data out of an existing index. I don't know of a mechanism whereby Lucy's behavior would change between different versions of Perl. In any case, having invalid UTF-8 in your Perl scalars is bad news -- it can do things like crash the regex engine. It will also lead to corrupt Lucy indexes that fail the search-time UTF-8 validity check. How are you getting raw data into Perl? > Anyway, I like the idea of rolling my own perl to be absolutely sure > of coherence across my machines. +1 Marvin Humphrey