Mailing-List: contact dev-help@lucy.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucy.apache.org
Received-SPF: pass (athena.apache.org: domain of vajda@osafoundation.org
 designates 149.20.54.96 as permitted sender)
References: <85DD0B91-7200-4490-AB5C-7CF888660C09@aevum.de>
 <CAAS6=7j2d7+S8CoW+u_j-q1X5NV9S_76jgZR3AJnb5t2ePvs2Q@mail.gmail.com>
 <5175352B.4080708@aevum.de>
 <CAAS6=7iCJ9yEPXjO30GH-z9myb1+61ZcDacB8zLN+UhE2N+5hA@mail.gmail.com>
From: Andi Vajda <vajda@osafoundation.org>
Content-Type: multipart/alternative;
	boundary=Apple-Mail-32773A91-D951-419E-A6D9-05D829992915
In-Reply-To: 
 <CAAS6=7iCJ9yEPXjO30GH-z9myb1+61ZcDacB8zLN+UhE2N+5hA@mail.gmail.com>
Message-Id: <F71DCC19-6483-4083-A019-69D0D8ECBFE2@osafoundation.org>
Date: Mon, 22 Apr 2013 23:02:01 -0700
To: "dev@lucy.apache.org" <dev@lucy.apache.org>
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (1.0)
Subject: Re: [lucy-dev] CharBuf functions taking char* arguments

--Apple-Mail-32773A91-D951-419E-A6D9-05D829992915
Content-Type: text/plain;
	charset=us-ascii
Content-Transfer-Encoding: quoted-printable


On Apr 22, 2013, at 22:01, Marvin Humphrey <marvin@rectangular.com> wrote:

> On Mon, Apr 22, 2013 at 6:03 AM, Nick Wellnhofer <wellnhofer@aevum.de> wro=
te:
>> Maybe we should start to flesh out the design of immutable Strings.  Do y=
ou
>> have a concrete plan already?
>=20
> To get started, we could simply duplicate CharBuf's implementation and str=
ip
> out the mutability. :)
>=20
> The unusual requirements of Clownfish Strings are that they have to wrap h=
ost
> strings when bridging the host/C border.  But it seems to me that this wil=
l
> always mean borrowing the host string's internal character array for use w=
ith
> a stack-allocated `const ZombieString*`.
>=20
> Looking forward...
>=20
> There are only so many ways to implement the "immutable String class" desi=
gn
> pattern. :)  See Python, Ruby Symbol, various implementations of Java
> String, C#, etc.
>=20
> The first question is how to handle the internal buffer.  CharBufs need to=
 own
> their buffers; immutable Strings do not.
>=20
> Multiple immutable String objects can share a single contiguous buffer, th=
ough
> the buffer must outlive all of them.  One possible implementation is to wr=
ap
> the buffer in an object which has a refcount or otherwise fits into the GC=

> regime.
>=20
>    String*
>    Str_init_from_trusted_utf8_byte_buf(String *self, ByteBuf *buffer,
>                                        size_t offset, size_t size) {
>        self->buffer   =3D (ByteBuf*)INCREF(buffer);
>        self->content  =3D (char*)BB_Get_Buf(buffer) + offset;
>        self->size     =3D size;
>        self->hash_sum =3D -1;
>        return self;
>    }
>=20
> This is similar to typical Java String implementations:
>=20
>    public class String {
>        private char[] value;
>        private int offset;  // location in `value` where string starts
>        private int count;   // length
>        private int hash;
>        ...
>    }
>=20
> There's extra memory cost to going that route, but it buys you some
> flexibility.
>=20
> The second question is whether to NUL-terminate UTF-8 Strings -- and as a
> corrolary, to guarantee that raw UTF-8 character data obtained from a Stri=
ng
> will be NUL-terminated.  This is hard.  Can we guarantee that every host
> string we wrap will be NUL-terminated?  I know Perl tries hard to keep str=
ing
> SVs NUL-terminated, but I don't imagine that every XS module everywhere
> succeeds.
>=20
> The alternative is not to NUL-terminate, but to cache a NUL-terminated
> C-string representation on demand.  As an optimization, we could check to s=
ee
> whether the internal buffer is in fact NUL-terminated and use it if it
> is.
>=20
>> How should CharBufs and Strings interact?
>=20
> IMO... CharBuf's primary use case should be to build Strings: after you've=

> manipulated the CharBuf to contain the desired character sequence, invoke
> To_String() to create a new String.
>=20
> It probably also makes sense to add a Yield_String() method to CharBuf whi=
ch
> spins off a String which steals the CharBuf's buffer and resets it to empt=
y.
>=20
>> Isn't UTF-32 used in Python (among other encodings)?

Python is moving to a model where a string could be in any UTF width, based o=
n its characters:
 http://www.python.org/dev/peps/pep-0393/

Andi..

>=20
> Yes, though it's not clear to me how often you'd encounter UTF-32 in the w=
ild.
>=20
>> I found only three users of the "steal" constructors:
>>=20
>>    * S_unescape_text in Lucy::Util::Json could be changed to use
>>      a CharBuf and Cat_Char
>>    * SkipStepper_to_string could simply use CB_newf, no?
>>    * DefDocReader_fetch_doc in the C bindings could create an
>>      extra copy or we could add something like InStream#ReadString.
>=20
> You're right about SkipStepper.  I'd suggest using a CharBuf with
> Yield_String() for the DefDocReader and Json use cases.
>=20
> Marvin Humphrey

--Apple-Mail-32773A91-D951-419E-A6D9-05D829992915--