Return-Path: X-Original-To: apmail-lucy-dev-archive@www.apache.org Delivered-To: apmail-lucy-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DF5E7F4B4 for ; Tue, 23 Apr 2013 06:04:06 +0000 (UTC) Received: (qmail 93568 invoked by uid 500); 23 Apr 2013 06:04:06 -0000 Delivered-To: apmail-lucy-dev-archive@lucy.apache.org Received: (qmail 93383 invoked by uid 500); 23 Apr 2013 06:04:02 -0000 Mailing-List: contact dev-help@lucy.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucy.apache.org Delivered-To: mailing list dev@lucy.apache.org Received: (qmail 93348 invoked by uid 99); 23 Apr 2013 06:04:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Apr 2013 06:04:01 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,MIME_QP_LONG_LINE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of vajda@osafoundation.org designates 149.20.54.96 as permitted sender) Received: from [149.20.54.96] (HELO leka.osafoundation.org) (149.20.54.96) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Apr 2013 06:03:56 +0000 Received: from [192.168.0.98] (ovaltofu.org [50.0.193.30]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by leka.osafoundation.org (Postfix) with ESMTPSA id 37EBE77D6F9 for ; Mon, 22 Apr 2013 23:03:35 -0700 (PDT) References: <85DD0B91-7200-4490-AB5C-7CF888660C09@aevum.de> <5175352B.4080708@aevum.de> From: Andi Vajda Content-Type: multipart/alternative; boundary=Apple-Mail-32773A91-D951-419E-A6D9-05D829992915 X-Mailer: iPhone Mail (10B329) In-Reply-To: Message-Id: Date: Mon, 22 Apr 2013 23:02:01 -0700 To: "dev@lucy.apache.org" Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (1.0) X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-dev] CharBuf functions taking char* arguments --Apple-Mail-32773A91-D951-419E-A6D9-05D829992915 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable On Apr 22, 2013, at 22:01, Marvin Humphrey wrote: > On Mon, Apr 22, 2013 at 6:03 AM, Nick Wellnhofer wro= te: >> Maybe we should start to flesh out the design of immutable Strings. Do y= ou >> have a concrete plan already? >=20 > To get started, we could simply duplicate CharBuf's implementation and str= ip > out the mutability. :) >=20 > The unusual requirements of Clownfish Strings are that they have to wrap h= ost > strings when bridging the host/C border. But it seems to me that this wil= l > always mean borrowing the host string's internal character array for use w= ith > a stack-allocated `const ZombieString*`. >=20 > Looking forward... >=20 > There are only so many ways to implement the "immutable String class" desi= gn > pattern. :) See Python, Ruby Symbol, various implementations of Java > String, C#, etc. >=20 > The first question is how to handle the internal buffer. CharBufs need to= own > their buffers; immutable Strings do not. >=20 > Multiple immutable String objects can share a single contiguous buffer, th= ough > the buffer must outlive all of them. One possible implementation is to wr= ap > the buffer in an object which has a refcount or otherwise fits into the GC= > regime. >=20 > String* > Str_init_from_trusted_utf8_byte_buf(String *self, ByteBuf *buffer, > size_t offset, size_t size) { > self->buffer =3D (ByteBuf*)INCREF(buffer); > self->content =3D (char*)BB_Get_Buf(buffer) + offset; > self->size =3D size; > self->hash_sum =3D -1; > return self; > } >=20 > This is similar to typical Java String implementations: >=20 > public class String { > private char[] value; > private int offset; // location in `value` where string starts > private int count; // length > private int hash; > ... > } >=20 > There's extra memory cost to going that route, but it buys you some > flexibility. >=20 > The second question is whether to NUL-terminate UTF-8 Strings -- and as a > corrolary, to guarantee that raw UTF-8 character data obtained from a Stri= ng > will be NUL-terminated. This is hard. Can we guarantee that every host > string we wrap will be NUL-terminated? I know Perl tries hard to keep str= ing > SVs NUL-terminated, but I don't imagine that every XS module everywhere > succeeds. >=20 > The alternative is not to NUL-terminate, but to cache a NUL-terminated > C-string representation on demand. As an optimization, we could check to s= ee > whether the internal buffer is in fact NUL-terminated and use it if it > is. >=20 >> How should CharBufs and Strings interact? >=20 > IMO... CharBuf's primary use case should be to build Strings: after you've= > manipulated the CharBuf to contain the desired character sequence, invoke > To_String() to create a new String. >=20 > It probably also makes sense to add a Yield_String() method to CharBuf whi= ch > spins off a String which steals the CharBuf's buffer and resets it to empt= y. >=20 >> Isn't UTF-32 used in Python (among other encodings)? Python is moving to a model where a string could be in any UTF width, based o= n its characters: http://www.python.org/dev/peps/pep-0393/ Andi.. >=20 > Yes, though it's not clear to me how often you'd encounter UTF-32 in the w= ild. >=20 >> I found only three users of the "steal" constructors: >>=20 >> * S_unescape_text in Lucy::Util::Json could be changed to use >> a CharBuf and Cat_Char >> * SkipStepper_to_string could simply use CB_newf, no? >> * DefDocReader_fetch_doc in the C bindings could create an >> extra copy or we could add something like InStream#ReadString. >=20 > You're right about SkipStepper. I'd suggest using a CharBuf with > Yield_String() for the DefDocReader and Json use cases. >=20 > Marvin Humphrey --Apple-Mail-32773A91-D951-419E-A6D9-05D829992915--