incubator-lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From goran kent <gorank...@gmail.com>
Subject [lucy-user] Different UTF-8 behaviour between perl 5.8.8 (indexes ok) and 5.10.1 (indexing fails)
Date Wed, 12 Oct 2011 13:13:36 GMT
Hi,

This is probably not a Lucy issue, but something I first noticed while
using Lucy on machines with different Perl versions (using CentOS 5.x
and CentOS 6).

On the machines with Perl 5.8.8 the indexer works as expected - ie, I
have no idea what it's doing when encountering UTF-8 text (which is
fine in my case since we don't really have to deal with UTF-8).

However, on machines where Perl 5.10.1 is installed (CentOS 6),
indexing fails when bad UTF-8 (in this case some nice Japanese fair)
is encountered:

...Malformed UTF-8 character... these are ignored OK.

but then:

...Invalid UTF-8, aborting:
lucy_ViewCB_assign_str at
.../projects/lucy/perl/../core/Lucy/Object/CharBuf.c line 848
at /usr/local/.../myscript line 2201
eval {...} called at ...

followed by

...Expected doc id 4 but got 5
lucy_DocWriter_add_inverted_doc at
.../projects/lucy/perl/../core/Lucy/Index/DocWriter.c line 97
...

and it never recovers.

Any ideas what I should be looking for?  Ideally, it would be great if
I could get perl 5.10 to behave like 5.8.  I'm tempted to just strip
out invalid crap with "iconv -c --from UTF-8 --to UTF-8", unless I can
find a nice non-regex (for performance) cpan module to either strip
out bad utf8 or to filter out all utf8 unconditionally.

sigh

Mime
View raw message