lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-dev] CharBuf functions taking char* arguments
Date Tue, 23 Apr 2013 17:12:13 GMT
On Apr 23, 2013, at 07:01 , Marvin Humphrey <marvin@rectangular.com> wrote:

> Multiple immutable String objects can share a single contiguous buffer, though
> the buffer must outlive all of them.  One possible implementation is to wrap
> the buffer in an object which has a refcount or otherwise fits into the GC
> regime.

...

> There's extra memory cost to going that route, but it buys you some
> flexibility.

I'm bit worried about memory cost. If the underlying buffer is a full-blown object, it will
use at least three words of memory (including the object header). The String object itself
uses at least another five. More if we cache things like hash sum or the length in code points.
Considering additional malloc overhead for two objects and the buffer itself, this can easily
add up to ~100 bytes on a 64-bit system (unless we restrict string size to 4GB). And all that
for strings which in many cases are only 10-15 characters long!

> The second question is whether to NUL-terminate UTF-8 Strings -- and as a
> corrolary, to guarantee that raw UTF-8 character data obtained from a String
> will be NUL-terminated.  This is hard.  Can we guarantee that every host
> string we wrap will be NUL-terminated?  I know Perl tries hard to keep string
> SVs NUL-terminated, but I don't imagine that every XS module everywhere
> succeeds.

Also, if we want to support substrings with a shared buffer, it's impossible to NUL-terminate
them.

BTW, this problem isn't restricted to UTF-8. UTF-16 strings also have to be NUL-terminated
if we want to pass them to the Windows file system API, for example.

> The alternative is not to NUL-terminate, but to cache a NUL-terminated
> C-string representation on demand.  As an optimization, we could check to see
> whether the internal buffer is in fact NUL-terminated and use it if it
> is.

Or simply create a new string every time. Do we need NUL-terminated strings that often?

>> How should CharBufs and Strings interact?
> 
> IMO... CharBuf's primary use case should be to build Strings: after you've
> manipulated the CharBuf to contain the desired character sequence, invoke
> To_String() to create a new String.
> 
> It probably also makes sense to add a Yield_String() method to CharBuf which
> spins off a String which steals the CharBuf's buffer and resets it to empty.

It could be argued that strings should only be created via Yield_String(). Otherwise, strings
would be mutable through the underlying buffer.

Nick


Mime
View raw message