incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Invalid UTF-8
Date Mon, 25 Jan 2010 17:48:13 GMT
On Mon, Jan 25, 2010 at 10:07:52AM -0600, Peter Karman wrote:

> Invalid UTF-8 sequence in
> '/opt/pij/search/sources.index.ks/seg_1/lextemp-12464-to-1353267' at
> byte 12466, kino_TextTermStepper_read_delta at
> ../core/KinoSearch/FieldType/TextType.c line 145

That's most likely an internal error.  For anyone else it would almost
certainly be an internal error, but you're a special case because you're doing
XS programming and there's a possibility, however slight, that you've supplied
a field value to KS with the SVf_UTF8 flag set but that contains an invalid
UTF-8 sequence.  When scalars arrive with that flag set, KS trusts the source
and skips the validity check.

I had a look around Search::Tools::UTF8, but got a little lost, so for what
it's worth, here's a double check using KS to back up Search::Tools::UTF8:

    sub force_valid_utf8 {
            or confess "Invalid UTF-8 byte sequence";

The UTF-8 validity check within TermStepper is new; I believe it's a
theoretically sound design, but maybe I've missed something.  It was put in
there to guard against corrupt lexicon data, due to disk error, maliciously
crafted index files, etc.  It would be bad to have Lexicons spewing corrupt

> The frustrating thing is that I just spent 2 weeks making sure my files
> are all valid UTF-8 (same old story -- legacy db with mix of latin1,
> cp1252, and UTF-8, sometimes all in the same string!), and they all pass
> my Search::Tools::UTF8 checks.
> What's odd is that the 'Invalid UTF-8 sequence' error is thrown during
> commit() rather than when I add_doc(), which makes me think that perhaps
> this isn't necessarily an encoding problem with my docs.

When I was working on this, the validity check was originally performed only
on the string difference.  However, KS uses a difference algorithm that pays
attention to bytes only, so it can split in the middle of a UTF-8 character.
Therefore, we have no choice but to scan the whole term each time.

Does that give you an idea about the kind of things that can go awry?
Nevertheless, I can't see anything wrong in the TermStepper_read_delta C code

> I see that all text strings are forced to UTF-8 in add_doc() via
> invert_doc() and the SvPVutf8 call, so presumably they should all be UTF-8
> by the time they reach the commit()?

Yes, that's the intent, although SvPVutf8 does not perform a validity check if
the SVf_UTF8 flag is set.

It would be interesting to see a hexdump of "lextemp" starting at byte 12464.
That's where the PostingPool run starts.  The combining sequence that triggers
the exception starts two bytes later, at 12466.

Marvin Humphrey

View raw message