Return-Path: X-Original-To: apmail-incubator-lucy-user-archive@www.apache.org Delivered-To: apmail-incubator-lucy-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 816597559 for ; Wed, 12 Oct 2011 14:37:56 +0000 (UTC) Received: (qmail 2814 invoked by uid 500); 12 Oct 2011 14:37:56 -0000 Delivered-To: apmail-incubator-lucy-user-archive@incubator.apache.org Received: (qmail 2789 invoked by uid 500); 12 Oct 2011 14:37:56 -0000 Mailing-List: contact lucy-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-user@incubator.apache.org Delivered-To: mailing list lucy-user@incubator.apache.org Received: (qmail 2781 invoked by uid 99); 12 Oct 2011 14:37:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Oct 2011 14:37:56 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [192.203.201.122] (HELO barracuda01.mpr.org) (192.203.201.122) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Oct 2011 14:37:50 +0000 X-ASG-Debug-ID: 1318430241-0223c510a028e210002-NHdD7U Received: from HQMAIL07.mpr.org (hqesnode01.mpr.org [10.2.2.100]) by barracuda01.mpr.org with ESMTP id rJabS7jfF2N0hCzX for ; Wed, 12 Oct 2011 09:37:21 -0500 (CDT) X-Barracuda-Envelope-From: peter@peknet.com X-ASG-Whitelist: Client Received: from mail.mpr.org ([10.2.2.87]) by HQMAIL07.mpr.org with Microsoft SMTPSVC(6.0.3790.4675); Wed, 12 Oct 2011 09:37:05 -0500 Received: from pkarmanpc ([10.2.8.182]) by mail.mpr.org with Microsoft SMTPSVC(6.0.3790.4675); Wed, 12 Oct 2011 09:37:05 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) by pkarmanpc (Postfix) with ESMTP id C061521A278 for ; Wed, 12 Oct 2011 09:37:05 -0500 (CDT) Message-ID: <4E95A611.3050100@peknet.com> Date: Wed, 12 Oct 2011 09:37:05 -0500 From: Peter Karman Reply-To: peter@peknet.com User-Agent: Thunderbird 2.0.0.24 (X11/20101027) MIME-Version: 1.0 To: lucy-user@incubator.apache.org References: X-ASG-Orig-Subj: Re: [lucy-user] Different UTF-8 behaviour between perl 5.8.8 (indexes ok) and 5.10.1 (indexing fails) In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 12 Oct 2011 14:37:06.0040 (UTC) FILETIME=[6A162F80:01CC88EC] X-Barracuda-Connect: hqesnode01.mpr.org[10.2.2.100] X-Barracuda-Start-Time: 1318430241 X-Barracuda-URL: http://barracuda.mpr.org:8000/cgi-mod/mark.cgi X-Virus-Scanned: by bsmtpd at mpr.org Subject: Re: [lucy-user] Different UTF-8 behaviour between perl 5.8.8 (indexes ok) and 5.10.1 (indexing fails) goran kent wrote on 10/12/2011 08:13 AM: > Hi, > > This is probably not a Lucy issue, but something I first noticed while > using Lucy on machines with different Perl versions (using CentOS 5.x > and CentOS 6). Aside from any UTF-8 bugs, I highly recommend *not* using the stock package-based Perl on any system, especially Redhat, for your applications. For application development it is far safer and more predictable to compile your own version of Perl in /opt or /usr/local or wherever, which then lets you control the Perl core version and compile-time options, not to mention CPAN module versions, without interfering with any dependencies with the system's packaged Perl. Redhat especially relies on its packaged Perl and specific CPAN modules for sysadmin tasks. I have been bitten by using the system perl as have many others I have talked with. Same goes for ports perl on FreeBSD, Mac OS X, other Linux flavors. > > On the machines with Perl 5.8.8 the indexer works as expected - ie, I > have no idea what it's doing when encountering UTF-8 text (which is > fine in my case since we don't really have to deal with UTF-8). except it seems you *are* dealing with it... > > However, on machines where Perl 5.10.1 is installed (CentOS 6), > indexing fails when bad UTF-8 (in this case some nice Japanese fair) > is encountered: > > ...Malformed UTF-8 character... these are ignored OK. > > but then: > > ...Invalid UTF-8, aborting: > lucy_ViewCB_assign_str at > .../projects/lucy/perl/../core/Lucy/Object/CharBuf.c line 848 > at /usr/local/.../myscript line 2201 > eval {...} called at ... > > followed by > > ...Expected doc id 4 but got 5 > lucy_DocWriter_add_inverted_doc at > .../projects/lucy/perl/../core/Lucy/Index/DocWriter.c line 97 > ... > > and it never recovers. > > Any ideas what I should be looking for? The string that causes the problem. > Ideally, it would be great if > I could get perl 5.10 to behave like 5.8. I'm tempted to just strip > out invalid crap with "iconv -c --from UTF-8 --to UTF-8", unless I can > find a nice non-regex (for performance) cpan module to either strip > out bad utf8 or to filter out all utf8 unconditionally. > If you really don't need to preserve your UTF-8 text, look at Search::Tools::Transliterate. Search::Tools::UTF8 is also helpful for debugging these kinds of issues. It sounds like, without seeing a reproduce-able test case, that Lucy is choking appropriately on malformed UTF-8. -- Peter Karman . http://peknet.com/ . peter@peknet.com