lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] CharBuf functions taking char* arguments
Date Wed, 24 Apr 2013 01:04:56 GMT
On Tue, Apr 23, 2013 at 10:12 AM, Nick Wellnhofer <wellnhofer@aevum.de> wrote:
> I'm bit worried about memory cost. If the underlying buffer is a full-blown
> object, it will use at least three words of memory (including the object
> header). The String object itself uses at least another five. More if we
> cache things like hash sum or the length in code points. Considering
> additional malloc overhead for two objects and the buffer itself, this can
> easily add up to ~100 bytes on a 64-bit system (unless we restrict string
> size to 4GB). And all that for strings which in many cases are only 10-15
> characters long!

All true.  Some Java implementations get acceptable results using this scheme,
but it's possible to make different tradeoffs.

There are a bazillion Unicode string classes out there; what differentiates
this one is that it has to support Clownfish's mission to integrate with
multiple host languages.  IMO, our first priority should be to avoid creating
any public APIs which interfere with that mission.  Internal implementation
details -- such as whether buffers are wrapped in objects -- have fewer
consequences.

If we're convinced that immutable String is a good idea, would you be OK
starting with the current CharBuf implementation minus the mutability and
working from there?  Or are we not yet there?

>> The alternative is not to NUL-terminate, but to cache a NUL-terminated
>> C-string representation on demand.  As an optimization, we could check to see
>> whether the internal buffer is in fact NUL-terminated and use it if it
>> is.
>
> Or simply create a new string every time. Do we need NUL-terminated strings
> that often?

I'm persuaded that exporting is the best approach.

No matter how we implement String, we will continue to be able to support
NUL-terminated exports.  In contrast supporting a cached UTF-8 representation
would paint us into a corner.

The practical consequences are:

*   String will not support Get_Ptr8().
*   Both String and CharBuf will need to support various flavors of
    Export_Raw_UTF*()
*   CharBuf should stop worrying about NUL-termination except at export time.
    (It's broken by CB_Set_Size() anyway.)
*   If CharBuf continues to provide support for Get_Ptr8(), NUL-termination
    will be the user's responsibility.

>>> How should CharBufs and Strings interact?
>>
>> IMO... CharBuf's primary use case should be to build Strings: after you've
>> manipulated the CharBuf to contain the desired character sequence, invoke
>> To_String() to create a new String.
>>
>> It probably also makes sense to add a Yield_String() method to CharBuf which
>> spins off a String which steals the CharBuf's buffer and resets it to empty.
>
> It could be argued that strings should only be created via Yield_String().
> Otherwise, strings would be mutable through the underlying buffer.

CB_To_String() will have to copy the CharBuf's content into newly allocated
memory in order to avoid the mutability problem.

Marvin Humphrey

Mime
View raw message