incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject String handling
Date Fri, 05 Sep 2008 00:09:36 GMT

Problem #1: C-style NULL-terminated strings are hateful.  They present  
innumerable opportunities for buffer-overrun security holes and  
segfaults.  You can't know the encoding.  You can't know their  
allocated capacity.

The solution is to pass around string data using a struct-based string  
class that tightly associates character data with the object's length  
and allocated capacity.  This presents some inconveniences in terms of  
initializing and argument passing, but in a library as large as Lucy,  
it is justified on the security rationale alone.

Problem #2: The various host languages that Lucy will be bound to have  
different ways of representing string data.  With Java, strings use  
UTF-16; with Perl strings are either UTF-8 or Latin-1 but we can force  
them to be UTF-8; etc.

I believe the solution to this is to create an abstract CharBuf base  
class that represents a Unicode string, plus CharBuf8 and CharBuf16  
subclasses.  As much as possible, we have Lucy internals deal with the  
opaque string parent class.

The KinoSearch code base currently contains a class named "CharBuf"  
which is basically CharBuf8.  I've been going through and modifying  
sections which accessed the char* internal string directly and  
changing them to use higher level abstraction; I haven't run into any  
showstoppers yet.

There are 5 main ways that strings will be used in Lucy:

   * Analysis/Tokenizing/Indexing.
   * File path manipulation.
   * Query parsing and construction.
   * Field name specification.
   * Error messages.

Of those list items, the only one that requires intimate interaction  
with the low-level encoding of the string for efficiency reasons is  

The others often involve a lot of interaction with the host.  We want  
to avoid malloc'ing new temp copies of strings whenever possible in  
the binding code; in the KS binding code, I've avoided that by  
allocating a "ViewCharBuf" on the stack and then copying the Perl  
scalar's string into it.  Here's the routine used to initialize a  

   static CHY_INLINE lucy_ViewCharBuf
   lucy_VCB_make_str(char *ptr, size_t size)
       lucy_ViewCharBuf retval;
       retval.ref.count = 1;
       retval._     = (lucy_VirtualTable*)&LUCY_VIEWCHARBUF;
       retval.cap   = 0;
       retval.size  = size;
       retval.ptr   = ptr;
       return retval;

It gets used like so:

   lucy_XSBind_hash_fetch(lucy_Hash *hash, SV *key_sv)
       size_t size;
       char *ptr = SvPVutf8(key_sv, size);
       lucy_ViewCharBuf key = lucy_VCB_make_str(ptr, size);
       return Lucy_Hash_Fetch(hash, (lucy_CharBuf*)&key);

I think this approach should work for other host languages as well.

Marvin Humphrey
Rectangular Research

View raw message