Return-Path: X-Original-To: apmail-lucy-dev-archive@www.apache.org Delivered-To: apmail-lucy-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DDD84FFCD for ; Sat, 13 Apr 2013 15:15:49 +0000 (UTC) Received: (qmail 92642 invoked by uid 500); 13 Apr 2013 15:15:49 -0000 Delivered-To: apmail-lucy-dev-archive@lucy.apache.org Received: (qmail 92602 invoked by uid 500); 13 Apr 2013 15:15:49 -0000 Mailing-List: contact dev-help@lucy.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucy.apache.org Delivered-To: mailing list dev@lucy.apache.org Received: (qmail 92590 invoked by uid 99); 13 Apr 2013 15:15:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Apr 2013 15:15:49 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [212.227.17.8] (HELO moutng.kundenserver.de) (212.227.17.8) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Apr 2013 15:15:43 +0000 Received: from [192.168.0.100] (mnch-5d866c1e.pool.mediaWays.net [93.134.108.30]) by mrelayeu.kundenserver.de (node=mrbap1) with ESMTP (Nemesis) id 0MXVos-1U6Yiy0hBW-00WeS3; Sat, 13 Apr 2013 17:15:22 +0200 Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\)) From: Nick Wellnhofer In-Reply-To: Date: Sat, 13 Apr 2013 17:15:20 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <7D60D702-3184-4B00-8F1B-D1308490E180@aevum.de> References: To: dev@lucy.apache.org X-Mailer: Apple Mail (2.1503) X-Provags-ID: V02:K0:kJDw64Mrz3WSettR1NqBr+etE3ovMsIqsUBM9WjOYHC T9Ch3ViBs/7p6PS86up8YlqlbdXbx+3Pesz7dhcfjQOgoBCo1a jVIjWl+QGymJTntbW6JtCYtg9q9yD6MkV2HmMBxojK3K3ZUjsQ 1Ws7Vgu8FL4C7nGnFK6VQqgUmTteu2djTWQE/7JTLuzCP+Jx7C Vn0Y2ZTL2n8hmDMLXF/ma2gIMcwNKTqbR/3IV791d+qqRm3v4n 3Uy/K+raSEEsGyjxQSzsYPNysqXtxOK/8PlHvC+AUW2n9GkGdu 1/cTdU6gs9tWzVoLk6Q1iR3W9NtD45ZZ0yg8rn5aQfmxxzmGRD bDm6ZzMWGGyY8s7xSHNbu1X72eGKbSoSqV/F3Ko3x X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-dev] Initial sketch of string iterators On Apr 13, 2013, at 04:01 , Marvin Humphrey = wrote: > * The next patch, "Convert CB_Trim_{Top|Tail} to use iterators", = looks > perfect as well. No need to implement in concrete subclasses, I = see. :) Yes, that's the idea. If the switch to iterators is complete, adding a = new string encoding should be really easy: * Implement the iterator class for the encoding. * Derive a new string class from CharBuf which only has to implement the iterator factory methods and Cat_Char. > * Should CharBufIterator#Next return 0 when it reaches the end of a = string > instead of throwing an exception? That would make it possible to = iterate > through text values which are not expected to contain NUL without = having > to call Has_Next() before each call to Next(). This seems like a worthwhile optimization. Here's how both versions = would look like: CharBufIterator *iter =3D CB_Top(string); while (CBIter_Has_Next(iter)) { uint32_t code_point =3D CBIter_Next(iter); ... } CharBufIterator *iter =3D CB_Top(string); uint32_t code_point; while (0 !=3D (code_point =3D CBIter_Next(iter))) { ... } I find the Has_Next() version more readable but the second variant isn't = too bad. The only thing I don't like is the choice of 0 as sentinel = value. I'd prefer -1 or any positive value beyond the Unicode planes. After reaching the end of the string, I can see two options which both = have their pros and cons: * Change to a new iterator state ("beyond string boundary") and = throw an exception on every subsequent access (like in your example = below). * Keep returning the sentinel value on calls to Next() but allow to move backward via Prev(). This can result in an infinite loop in faulty code. > * I can imagine some possibilities for optimizing the internals, but = those > are implementation details (which I imagine you left out on = purpose). Yes, the Prev() and Next() methods should decode characters directly = without calling StrHelp_decode_utf8_char. > It seems to me that the factory methods Top() and Tail() should be = public, > while ZTop() and ZTail() should have parcel exposure. More generally, = I've > moved towards the position that while we want to have the core = allocate > objects on the stack for certain internal tasks, it would be unwise at = this > point for Clownfish to support stack object allocation as a public = API. > I'm actually hoping that as we overhaul Clownfish string handling that > ZombieCharBuf can go away. But what you've presented makes sense and = fits > perfectly within the current context. I think the string iterators are good candidates for "zombie" objects. = They're small and typically only used within a single function. But +1 = for not exposing ZombieIterators publicly. > How about something like this, which I think should have the same = cost? >=20 > - if (self->byte_offset >=3D char_buf->size) { > - THROW(ERR, "Iteration past end of string"); > - } > + if (self->byte_offset > char_buf->size) { > + THROW(ERR, "Iteration past end of string"); > + } > + else if (self->byte_offset =3D=3D char_buf->size) { Something like ++self->byte_offset would be needed here. > + return 0; > + } See discussion above. >> +ZombieUTF8Iterator* >> +ZUTF8Iter_new(void *allocation, CharBuf *char_buf, size_t = byte_offset) { >> + ZombieUTF8Iterator *self =3D (ZombieUTF8Iterator*)allocation; >> + self->ref.count =3D 1; >=20 > Might want to go through VTable_Init_Obj(), no? ;) Sure ;) Nick