lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-dev] CharBuf functions taking char* arguments
Date Mon, 22 Apr 2013 13:03:39 GMT
On 22/04/2013 06:06, Marvin Humphrey wrote:
> The optimal answers to these questions may change after we introduce of an
> immutable String class.  String is the use case that needs all the convenience
> constructors.  CharBuf probably needs only one constructor, which takes a size
> for its buffer.
>
>      public inert incremented CharBuf*
>      new(size_t size);

Maybe we should start to flesh out the design of immutable Strings. Do 
you have a concrete plan already? How should CharBufs and Strings interact?

> Another thing to consider is which encodings we might support via raw content
> constructors.  I suspect we'll never go beyond common Unicode: UTF-8,
> UTF-16LE, UTF-16BE, and possibly UTF-32LE and UTF-32BE.

Isn't UTF-32 used in Python (among other encodings)?

> Anything else belongs in a library a la Perl's Encode.

+1

> It would be nice if we could eliminate the "steal" variants -- they tend to
> constrain how we implement String internally.

I found only three users of the "steal" constructors:

     * S_unescape_text in Lucy::Util::Json could be changed to use
       a CharBuf and Cat_Char
     * SkipStepper_to_string could simply use CB_newf, no?
     * DefDocReader_fetch_doc in the C bindings could create an
       extra copy or we could add something like InStream#ReadString.

> Since Mimic_Str() changes content, it's only relevant for CharBuf, not String.
>
> Lucy currently uses Mimic_Str() in two places: FSDirHandle and PostingPool.
> If FSDirHandle continues to use CharBuf instead of switching to String it
> could use CB_setf() instead.  (FSDirHandle's usage is bogus anyway because
> it's assuming UTF-8 path names -- but it will at least throw an error rather
> than segfault.)  CB_setf() won't work for PostingPool, but there are still
> plenty of alternatives to Mimic_Str().  With a little work, I think we can
> eliminate with a Mimic_Str().

+1

>>      * Cat_Str
>>      * Cat_Trusted_Str
>>      * Starts_With_Str
>>      * Ends_With_Str
>>      * Find_Str
>>      * Equals_Str
>>
>> These should probably expect the char* to always be in UTF-8.
>
> Yes.  And I think we should rename them *_UTF8 instead of *_Str to reflect
> that fact.  That will both clear up what encoding they expect and eliminate
> potential confusion with String.

+1

Nick


Mime
View raw message