lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-dev] Initial sketch of string iterators
Date Sat, 13 Apr 2013 15:15:20 GMT
On Apr 13, 2013, at 04:01 , Marvin Humphrey <marvin@rectangular.com> wrote:

> *   The next patch, "Convert CB_Trim_{Top|Tail} to use iterators", looks
>    perfect as well.  No need to implement in concrete subclasses, I see. :)

Yes, that's the idea. If the switch to iterators is complete, adding a new string encoding
should be really easy:

    * Implement the iterator class for the encoding.
    * Derive a new string class from CharBuf which only has to implement
      the iterator factory methods and Cat_Char.

> *   Should CharBufIterator#Next return 0 when it reaches the end of a string
>    instead of throwing an exception?  That would make it possible to iterate
>    through text values which are not expected to contain NUL without having
>    to call Has_Next() before each call to Next().

This seems like a worthwhile optimization. Here's how both versions would look like:

    CharBufIterator *iter = CB_Top(string);
    while (CBIter_Has_Next(iter)) {
        uint32_t code_point = CBIter_Next(iter);
        ...
    }

    CharBufIterator *iter = CB_Top(string);
    uint32_t code_point;
    while (0 != (code_point = CBIter_Next(iter))) {
        ...
    }

I find the Has_Next() version more readable but the second variant isn't too bad. The only
thing I don't like is the choice of 0 as sentinel value. I'd prefer -1 or any positive value
beyond the Unicode planes.

After reaching the end of the string, I can see two options which both have their pros and
cons:

    * Change to a new iterator state ("beyond string boundary") and throw
      an exception on every subsequent access (like in your example below).
    * Keep returning the sentinel value on calls to Next() but allow to
      move backward via Prev(). This can result in an infinite loop in
      faulty code.

> *   I can imagine some possibilities for optimizing the internals, but those
>    are implementation details (which I imagine you left out on purpose).

Yes, the Prev() and Next() methods should decode characters directly without calling StrHelp_decode_utf8_char.

> It seems to me that the factory methods Top() and Tail() should be public,
> while ZTop() and ZTail() should have parcel exposure.  More generally, I've
> moved towards the position that while we want to have the core allocate
> objects on the stack for certain internal tasks, it would be unwise at this
> point for Clownfish to support stack object allocation as a public API.
> I'm actually hoping that as we overhaul Clownfish string handling that
> ZombieCharBuf can go away.  But what you've presented makes sense and fits
> perfectly within the current context.

I think the string iterators are good candidates for "zombie" objects. They're small and typically
only used within a single function. But +1 for not exposing ZombieIterators publicly.

> How about something like this, which I think should have the same cost?
> 
> -    if (self->byte_offset >= char_buf->size) {
> -        THROW(ERR, "Iteration past end of string");
> -    }
> +    if (self->byte_offset > char_buf->size) {
> +        THROW(ERR, "Iteration past end of string");
> +    }
> +    else if (self->byte_offset == char_buf->size) {

Something like ++self->byte_offset would be needed here.

> +        return 0;
> +    }

See discussion above.

>> +ZombieUTF8Iterator*
>> +ZUTF8Iter_new(void *allocation, CharBuf *char_buf, size_t byte_offset) {
>> +    ZombieUTF8Iterator *self = (ZombieUTF8Iterator*)allocation;
>> +    self->ref.count   = 1;
> 
> Might want to go through VTable_Init_Obj(), no?  ;)

Sure ;)

Nick


Mime
View raw message