lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: [lucy-user] Index state during merges
Date Fri, 04 Nov 2011 01:15:18 GMT
On Wed, Nov 02, 2011 at 11:59:34AM -0700, Nathan Kurz wrote:
> On Wed, Nov 2, 2011 at 11:29 AM, Marvin Humphrey <> wrote:
> > What do you mean by "broken source index"?  Corrupt because bad UTF-8 snuck
> > in, and now it refuses to be read?
> >
> > Maybe we should consider scanning incoming fields for UTF-8 sanity after all.
> > I don't like making everybody pay this penalty -- small though it is --
> > because you'll only get bad UTF-8 if your indexing setup is broken somehow.
> > On the other hand, I don't like that once a single bad UTF-8 sequence makes it
> > through a commit, the index is irretrievably corrupt -- and you only discover
> > that after the damage is done.
> This seems like good practice.  I don't know the exact routine, but
> the performance impact has to be minimal.  

It turns out that the UTF-8 validity checking has been enabled after all -- for
several years now. :P

For the record, I benchmarked disabling it, and got a speedup on the indexing
benchmark by about half a percent.  That's pretty dang small, especially since
the indexing benchmarker uses an unrealistically simple Analyzer.

> ps.  I came across this possibly relevant discussion of a Perl
> 'feature' I wasn't aware of:

The patch to disable the sanity checking, pasted below my sig, involves
changing a method call from "Assign_Str" (which performs a validity check) to
"Assign_Trusted_Str" (which trusts that the string is valid and skips the
check).  I deliberately gave the unsafe method a more cumbersome and
unambiguous name so that, so that anybody invoking the "wrong" method would
make their error in the "safe" direction -- think of it as "fail-safe"
interface design applied to method naming.

The primary influence on this design was the negative example set by Perl's
lousy UTF-8 input interface, as detailed in that Jeremy Zawodny blog post
(which I've read before).  I wanted to do the opposite of this:

    # Short, obvious name is unsafe -- no sanity checking.
    open( my $fh, '<:utf8', $path ) or die $!;
    # Long, obscure incantation is safe -- sanity checking is enabled.
    open( my $fh, '<:encoding(UTF-8)', $path ) or die $!;

Marvin Humphrey

Index: ../perl/xs/Lucy/Index/Inverter.c
--- ../perl/xs/Lucy/Index/Inverter.c    (revision 1196798)
+++ ../perl/xs/Lucy/Index/Inverter.c    (working copy)
@@ -102,7 +102,7 @@
                     char *val_ptr = SvPVutf8(value_sv, val_len);
                     lucy_ViewCharBuf *value
                         = (lucy_ViewCharBuf*)inv_entry->value;
-                    Lucy_ViewCB_Assign_Str(value, val_ptr, val_len);
+                    Lucy_ViewCB_Assign_Trusted_Str(value, val_ptr, val_len);
             case lucy_FType_BLOB: {

View raw message