incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Karman <>
Subject Invalid UTF-8
Date Mon, 25 Jan 2010 16:07:52 GMT
Now, I'm seeing this error against latest svn trunk:

Invalid UTF-8 sequence in
'/opt/pij/search/sources.index.ks/seg_1/lextemp-12464-to-1353267' at
byte 12466, kino_TextTermStepper_read_delta at
../core/KinoSearch/FieldType/TextType.c line 145

The frustrating thing is that I just spent 2 weeks making sure my files
are all valid UTF-8 (same old story -- legacy db with mix of latin1,
cp1252, and UTF-8, sometimes all in the same string!), and they all pass
my Search::Tools::UTF8 checks.

What's odd is that the 'Invalid UTF-8 sequence' error is thrown during
commit() rather than when I add_doc(), which makes me think that perhaps
this isn't necessarily an encoding problem with my docs. I see that all
text strings are forced to UTF-8 in add_doc() via invert_doc() and the
SvPVutf8 call, so presumably they should all be UTF-8 by the time they
reach the commit()?

Peter Karman  .  .

View raw message