lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: [lucy-dev] Proposal for implementation of immutable strings
Date Fri, 26 Apr 2013 22:44:57 GMT
On Fri, Apr 26, 2013 at 4:10 AM, Nick Wellnhofer <> wrote:
> On 26/04/2013 01:48, Marvin Humphrey wrote:
>> Substrings of zombie strings are dangerous, because the buffer belonging to
>> the parent object may not outlive the substring.
> Right. This even applies to zombie strings alone. Consider assigning a
> zombie string to a member var. This is currently done with
>     self->var = CB_Clone(zstr);
> With immutable strings, we don't have to create a copy and can simply write
>     self->var = (String*)INCREF(zstr);
> This will break with zombie strings.


>> User-defined procedures will encounter ZombieStrings via wrapped callbacks
>> -- if a parameter is `String*` they'll get a real String with copied
>> content from host argument, but if it's `const String*`, they'll get a
>> ZombieString* wrapping the host string content.
> That's a great solution. But this isn't implemented yet, right?

The current type mapping code differentiates between `const CharBuf*` and
`CharBuf*` (though I have discovered today that the implementation is buggy
and wraps host strings more often than it should).

However, as you note, INCREF still has to be fixed because it casts away
const-ness.  In addition, all of our dynamic method invocations, including
Inc_RefCount(), cast away const-ness.  Here's how we change that:

--- a/clownfish/compiler/src/CFCBindMethod.c
+++ b/clownfish/compiler/src/CFCBindMethod.c
@@ -114,7 +114,7 @@ S_virtual_method_def(CFCMethod *method, CFCClass *klass) {
     const char pattern[] =
         "extern %sVISIBLE size_t %s;\n"
         "static CHY_INLINE %s\n"
-        "%s(const %s *self%s) {\n"
+        "%s(%s *self%s) {\n"
         "    const %s method = (%s)cfish_obj_method(self, %s);\n"
         "    %smethod((%s*)self%s);\n"

Try that patch and then compile Lucy after setting the LUCY_DEBUG environment
variable in order to see warnings from methods being invoked on `const
CharBuf*` objects.

    export LUCY_DEBUG=""
    perl Build.PL
    ./Build code

> It would also require that String methods can be invoked on const Strings
> (like const member functions in C++). Would this work without further
> changes?

Syntactically, it's not that hard -- we just add `const` to the invocant for
const methods.

    public int32_t
    Hash_Sum(const Obj *self);  // <---- note `const Obj*` invocant

Const methods can be pretty annoying though.  If we were actually to make
Hash_Sum() const, that would prevent subclass implementations which cache hash
sums in the object lazily.

>> Unless we want to require that `SubString` operate on non-const String*
>> (like we will for `Inc_RefCount`), ZStr_SubString() will have to return a
>> fully independent String object which owns its own buffer.
> That shouldn't be a problem. BTW, the INCREF macro should be changed so it
> doesn't work with const objects, see example above.

I think something like this can work on systems which support GNU extensions:

#define CFISH_INCREF(_self) \
    ({ void *_vself = _self; lucy_Obj_incref((lucy_Obj*)_vself); })

The `void *_vself = _self;` construct causes a compiler warning under GCC and
an error under G++ if `_self` is a pointer to someting `const`.

>>> For zombie strings, it's assumed that they don't have to care about the
>>> lifetime of the character buffer. So there are two cases left out:
>>> stack-allocated strings that own a buffer
>> Can we make that an invalid state and avoid it?
> Yes, we'll simply make the assumption that zombie strings never own a
> buffer.


>> I think that's a good approach, but I have an ulterior motive -- I'm hoping
>> that ultimately we end up with one class handling all encodings, a la
>> <>.
> That would only require a member var to store the encoding. But I don't
> quite understand the rationale behind this. Does it have to do with the
> Python bindings?

Nothing to do with Python bindings.  It's that we have multiple behaviors
which we've been trying to account for via inheritance: encoding and
"zombie"-ness.  The fact that we'd need not only StringUTF8 and StringUTF16,
but also ZombieStringUTF8 and ZombieStringUTF16 suggests that we haven't got
the abstraction quite right.

Here's an excerpt from the GoF _Design Patterns_ book, in the "Motivation" for
the Bridge Pattern, which describes a similar situation:

    Consider the implementation of a portable Window abstraction in a user
    interface toolkit. This abstraction should enable us to write applications
    that work on both the X Window System and IBM's Presentation Manager (PM),
    for example. Using inheritance, we could define an abstract class Window
    and subclasses XWindow and PMWindow that implement the Window interface
    for the different platforms. But this approach has two drawbacks:

    1.  It's inconvenient to extend the Window abstraction to cover different
        kinds of windows or new platforms. Imagine an IconWindow subclass of
        Window that specializes the Window abstraction for icons. To support
        IconWindows for both platforms, we have to implement two new classes,
        XIconWindow and PMIconWindow. Worse, we'll have to define two classes
        for every kind of window. Supporting a third platform requires yet
        another new Window subclass for every kind of window.

>> PS: Is it now true that ZombieStrings can only ever be allocated on the
>>     stack, rather than in static memory?  Because if that's the case, I'd
>>     favor the name StackString instead.
> In ZombieKeyedHash, they're allocated from a MemPool. Otherwise, all
> allocations seem to be from the stack.

Ah, thank you for reminding me.

I think supporting arbitrary object allocators a la ZombieKeyedHash is
probably an ill-advised hack by someone who knows too much about the internals
rather than an important feature that Clownfish needs to support.

Marvin Humphrey

View raw message