lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andi Vajda <va...@osafoundation.org>
Subject Re: [lucy-dev] CharBuf functions taking char* arguments
Date Tue, 23 Apr 2013 06:02:01 GMT

On Apr 22, 2013, at 22:01, Marvin Humphrey <marvin@rectangular.com> wrote:

> On Mon, Apr 22, 2013 at 6:03 AM, Nick Wellnhofer <wellnhofer@aevum.de> wrote:
>> Maybe we should start to flesh out the design of immutable Strings.  Do you
>> have a concrete plan already?
> 
> To get started, we could simply duplicate CharBuf's implementation and strip
> out the mutability. :)
> 
> The unusual requirements of Clownfish Strings are that they have to wrap host
> strings when bridging the host/C border.  But it seems to me that this will
> always mean borrowing the host string's internal character array for use with
> a stack-allocated `const ZombieString*`.
> 
> Looking forward...
> 
> There are only so many ways to implement the "immutable String class" design
> pattern. :)  See Python, Ruby Symbol, various implementations of Java
> String, C#, etc.
> 
> The first question is how to handle the internal buffer.  CharBufs need to own
> their buffers; immutable Strings do not.
> 
> Multiple immutable String objects can share a single contiguous buffer, though
> the buffer must outlive all of them.  One possible implementation is to wrap
> the buffer in an object which has a refcount or otherwise fits into the GC
> regime.
> 
>    String*
>    Str_init_from_trusted_utf8_byte_buf(String *self, ByteBuf *buffer,
>                                        size_t offset, size_t size) {
>        self->buffer   = (ByteBuf*)INCREF(buffer);
>        self->content  = (char*)BB_Get_Buf(buffer) + offset;
>        self->size     = size;
>        self->hash_sum = -1;
>        return self;
>    }
> 
> This is similar to typical Java String implementations:
> 
>    public class String {
>        private char[] value;
>        private int offset;  // location in `value` where string starts
>        private int count;   // length
>        private int hash;
>        ...
>    }
> 
> There's extra memory cost to going that route, but it buys you some
> flexibility.
> 
> The second question is whether to NUL-terminate UTF-8 Strings -- and as a
> corrolary, to guarantee that raw UTF-8 character data obtained from a String
> will be NUL-terminated.  This is hard.  Can we guarantee that every host
> string we wrap will be NUL-terminated?  I know Perl tries hard to keep string
> SVs NUL-terminated, but I don't imagine that every XS module everywhere
> succeeds.
> 
> The alternative is not to NUL-terminate, but to cache a NUL-terminated
> C-string representation on demand.  As an optimization, we could check to see
> whether the internal buffer is in fact NUL-terminated and use it if it
> is.
> 
>> How should CharBufs and Strings interact?
> 
> IMO... CharBuf's primary use case should be to build Strings: after you've
> manipulated the CharBuf to contain the desired character sequence, invoke
> To_String() to create a new String.
> 
> It probably also makes sense to add a Yield_String() method to CharBuf which
> spins off a String which steals the CharBuf's buffer and resets it to empty.
> 
>> Isn't UTF-32 used in Python (among other encodings)?

Python is moving to a model where a string could be in any UTF width, based on its characters:
 http://www.python.org/dev/peps/pep-0393/

Andi..

> 
> Yes, though it's not clear to me how often you'd encounter UTF-32 in the wild.
> 
>> I found only three users of the "steal" constructors:
>> 
>>    * S_unescape_text in Lucy::Util::Json could be changed to use
>>      a CharBuf and Cat_Char
>>    * SkipStepper_to_string could simply use CB_newf, no?
>>    * DefDocReader_fetch_doc in the C bindings could create an
>>      extra copy or we could add something like InStream#ReadString.
> 
> You're right about SkipStepper.  I'd suggest using a CharBuf with
> Yield_String() for the DefDocReader and Json use cases.
> 
> Marvin Humphrey

Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message