Mailing-List: contact dev-help@lucy.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucy.apache.org
Received-SPF: pass (nike.apache.org: local policy)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\))
From: Nick Wellnhofer <wellnhofer@aevum.de>
In-Reply-To: 
 <CAAS6=7jFr+XNcvi=VsWpOnK7_D9jXuh2eS6nf15uzObXB1BrxQ@mail.gmail.com>
Date: Sat, 13 Apr 2013 17:15:20 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <7D60D702-3184-4B00-8F1B-D1308490E180@aevum.de>
References: 
 <CAAS6=7jFr+XNcvi=VsWpOnK7_D9jXuh2eS6nf15uzObXB1BrxQ@mail.gmail.com>
To: dev@lucy.apache.org
Subject: Re: [lucy-dev] Initial sketch of string iterators

On Apr 13, 2013, at 04:01 , Marvin Humphrey <marvin@rectangular.com> =
wrote:

> *   The next patch, "Convert CB_Trim_{Top|Tail} to use iterators", =
looks
>    perfect as well.  No need to implement in concrete subclasses, I =
see. :)

Yes, that's the idea. If the switch to iterators is complete, adding a =
new string encoding should be really easy:

    * Implement the iterator class for the encoding.
    * Derive a new string class from CharBuf which only has to implement
      the iterator factory methods and Cat_Char.

> *   Should CharBufIterator#Next return 0 when it reaches the end of a =
string
>    instead of throwing an exception?  That would make it possible to =
iterate
>    through text values which are not expected to contain NUL without =
having
>    to call Has_Next() before each call to Next().

This seems like a worthwhile optimization. Here's how both versions =
would look like:

    CharBufIterator *iter =3D CB_Top(string);
    while (CBIter_Has_Next(iter)) {
        uint32_t code_point =3D CBIter_Next(iter);
        ...
    }

    CharBufIterator *iter =3D CB_Top(string);
    uint32_t code_point;
    while (0 !=3D (code_point =3D CBIter_Next(iter))) {
        ...
    }

I find the Has_Next() version more readable but the second variant isn't =
too bad. The only thing I don't like is the choice of 0 as sentinel =
value. I'd prefer -1 or any positive value beyond the Unicode planes.

After reaching the end of the string, I can see two options which both =
have their pros and cons:

    * Change to a new iterator state ("beyond string boundary") and =
throw
      an exception on every subsequent access (like in your example =
below).
    * Keep returning the sentinel value on calls to Next() but allow to
      move backward via Prev(). This can result in an infinite loop in
      faulty code.

> *   I can imagine some possibilities for optimizing the internals, but =
those
>    are implementation details (which I imagine you left out on =
purpose).

Yes, the Prev() and Next() methods should decode characters directly =
without calling StrHelp_decode_utf8_char.

> It seems to me that the factory methods Top() and Tail() should be =
public,
> while ZTop() and ZTail() should have parcel exposure.  More generally, =
I've
> moved towards the position that while we want to have the core =
allocate
> objects on the stack for certain internal tasks, it would be unwise at =
this
> point for Clownfish to support stack object allocation as a public =
API.
> I'm actually hoping that as we overhaul Clownfish string handling that
> ZombieCharBuf can go away.  But what you've presented makes sense and =
fits
> perfectly within the current context.

I think the string iterators are good candidates for "zombie" objects. =
They're small and typically only used within a single function. But +1 =
for not exposing ZombieIterators publicly.

> How about something like this, which I think should have the same =
cost?
>=20
> -    if (self->byte_offset >=3D char_buf->size) {
> -        THROW(ERR, "Iteration past end of string");
> -    }
> +    if (self->byte_offset > char_buf->size) {
> +        THROW(ERR, "Iteration past end of string");
> +    }
> +    else if (self->byte_offset =3D=3D char_buf->size) {

Something like ++self->byte_offset would be needed here.

> +        return 0;
> +    }

See discussion above.

>> +ZombieUTF8Iterator*
>> +ZUTF8Iter_new(void *allocation, CharBuf *char_buf, size_t =
byte_offset) {
>> +    ZombieUTF8Iterator *self =3D (ZombieUTF8Iterator*)allocation;
>> +    self->ref.count   =3D 1;
>=20
> Might want to go through VTable_Init_Obj(), no?  ;)

Sure ;)

Nick