incubator-lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Karman <pe...@peknet.com>
Subject Re: [lucy-user] Different UTF-8 behaviour between perl 5.8.8 (indexes ok) and 5.10.1 (indexing fails)
Date Wed, 12 Oct 2011 14:37:05 GMT
goran kent wrote on 10/12/2011 08:13 AM:
> Hi,
> 
> This is probably not a Lucy issue, but something I first noticed while
> using Lucy on machines with different Perl versions (using CentOS 5.x
> and CentOS 6).

Aside from any UTF-8 bugs, I highly recommend *not* using the stock
package-based Perl on any system, especially Redhat, for your
applications. For application development it is far safer and more
predictable to compile your own version of Perl in /opt or /usr/local or
wherever, which then lets you control the Perl core version and
compile-time options, not to mention CPAN module versions, without
interfering with any dependencies with the system's packaged Perl.
Redhat especially relies on its packaged Perl and specific CPAN modules
for sysadmin tasks. I have been bitten by using the system perl as have
many others I have talked with. Same goes for ports perl on FreeBSD, Mac
OS X, other Linux flavors.

</soapbox>

> 
> On the machines with Perl 5.8.8 the indexer works as expected - ie, I
> have no idea what it's doing when encountering UTF-8 text (which is
> fine in my case since we don't really have to deal with UTF-8).

except it seems you *are* dealing with it...


> 
> However, on machines where Perl 5.10.1 is installed (CentOS 6),
> indexing fails when bad UTF-8 (in this case some nice Japanese fair)
> is encountered:
> 
> ...Malformed UTF-8 character... these are ignored OK.
> 
> but then:
> 
> ...Invalid UTF-8, aborting:
> lucy_ViewCB_assign_str at
> .../projects/lucy/perl/../core/Lucy/Object/CharBuf.c line 848
> at /usr/local/.../myscript line 2201
> eval {...} called at ...
> 
> followed by
> 
> ...Expected doc id 4 but got 5
> lucy_DocWriter_add_inverted_doc at
> .../projects/lucy/perl/../core/Lucy/Index/DocWriter.c line 97
> ...
> 
> and it never recovers.
> 
> Any ideas what I should be looking for?  

The string that causes the problem.

> Ideally, it would be great if
> I could get perl 5.10 to behave like 5.8.  I'm tempted to just strip
> out invalid crap with "iconv -c --from UTF-8 --to UTF-8", unless I can
> find a nice non-regex (for performance) cpan module to either strip
> out bad utf8 or to filter out all utf8 unconditionally.
> 

If you really don't need to preserve your UTF-8 text, look at
Search::Tools::Transliterate. Search::Tools::UTF8 is also helpful for
debugging these kinds of issues.

It sounds like, without seeing a reproduce-able test case, that Lucy is
choking appropriately on malformed UTF-8.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Mime
View raw message