lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Kurz <>
Subject Re: [lucy-user] Index state during merges
Date Wed, 02 Nov 2011 18:59:34 GMT
On Wed, Nov 2, 2011 at 11:29 AM, Marvin Humphrey <> wrote:
> What do you mean by "broken source index"?  Corrupt because bad UTF-8 snuck
> in, and now it refuses to be read?
> Maybe we should consider scanning incoming fields for UTF-8 sanity after all.
> I don't like making everybody pay this penalty -- small though it is --
> because you'll only get bad UTF-8 if your indexing setup is broken somehow.
> On the other hand, I don't like that once a single bad UTF-8 sequence makes it
> through a commit, the index is irretrievably corrupt -- and you only discover
> that after the damage is done.

This seems like good practice.  I don't know the exact routine, but
the performance impact has to be minimal.   If it's already in
processor cache, any single pass through the string will be almost
free: it's already in cache, and I can't believe this step is CPU
limited. If you want, you could make it be Safe by default and Risky
by explicit option, but you might test first to be sure you even need
the option.


ps.  I came across this possibly relevant discussion of a Perl
'feature' I wasn't aware of:

View raw message