lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] CharBuf functions taking char* arguments
Date Wed, 24 Apr 2013 18:47:11 GMT
On Mon, Apr 22, 2013 at 11:02 PM, Andi Vajda <vajda@osafoundation.org> wrote:

>>> Isn't UTF-32 used in Python (among other encodings)?
>
> Python is moving to a model where a string could be in any UTF width, based
> on its characters:
>
>  http://www.python.org/dev/peps/pep-0393/

Thanks, Andi.  Like strings in Python (and Perl -- but not Java), strings in
Clownfish have a requirement to support multiple encodings for the same
logical content.  Reviewing this PEP gave me the opportunity to rethink some
assumptions I'd made when CharBuf was written.

My expectation was that we'd ultimately support encoding variability through
subclassing: CharBufUTF8, CharBufUTF16, and so on -- but Python has everything
in one class.  That would have seemed unwieldy for a mutable type, but maybe
it's reasonable if our String type is immutable.

Our motivations for supporting multiple internal encodings differ from those
of Python.

*   In Python, the unfortunate idiom of treating strings as random-access
    character arrays has to be supported, so strings support multiple
    fixed-width representations (ASCII, UCS2, UTF-32) and the smallest
    width is chosen (according to the largest code point in the string) in
    order to minimize memory.
*   In Clownfish, we're driven by the need to interface with multiple host
    languages (though not at the same time, hmm).

I suggested earlier that CharBuf might need only a single constructor, with an
initial capacity argument -- but once we start supporting multiple encodings,
that will have to be specified as well.  However, for the sake of simplicity,
robustness and speed, objects which are used to build up strings should
probably support only one encoding.

Nick, it seems to me that your iterators can work well with either a
single-class or a subclassing approach, for both CharBuf and String.
Thoughts?

I'd prefer not to commit one way or the other yet -- we can implement an
immutable String class while maintaining support for only UTF-8 right now, and
take stock later on.  There's going to be a lot of superficial churn in Lucy
as we change `CharBuf` to `String` everywhere.  The implementation changes
later won't have such large ripple effects.

Marvin Humphrey

Mime
View raw message