lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] CharBuf functions taking char* arguments
Date Mon, 22 Apr 2013 04:06:23 GMT
On Fri, Apr 19, 2013 at 10:14 AM, Nick Wellnhofer <wellnhofer@aevum.de> wrote:
> How should the CharBuf functions and methods that take a char* argument
> behave if we start to support multiple encodings?

The optimal answers to these questions may change after we introduce of an
immutable String class.  String is the use case that needs all the convenience
constructors.  CharBuf probably needs only one constructor, which takes a size
for its buffer.

    public inert incremented CharBuf*
    new(size_t size);

Another thing to consider is which encodings we might support via raw content
constructors.  I suspect we'll never go beyond common Unicode: UTF-8,
UTF-16LE, UTF-16BE, and possibly UTF-32LE and UTF-32BE.  Anything else belongs
in a library a la Perl's Encode.

>     * new_from_utf8
>     * new_from_trusted_utf8
>     * new_steal_str
>     * new_steal_from_trusted_str
>
> These should obviously expect the char* to have the same encoding as the
> CharBuf to be created.

Agreed on the first two.

It would be nice if we could eliminate the "steal" variants -- they tend to
constrain how we implement String internally.

>     * Mimic_Str
>
> Not sure about that one.

Since Mimic_Str() changes content, it's only relevant for CharBuf, not String.

Lucy currently uses Mimic_Str() in two places: FSDirHandle and PostingPool.
If FSDirHandle continues to use CharBuf instead of switching to String it
could use CB_setf() instead.  (FSDirHandle's usage is bogus anyway because
it's assuming UTF-8 path names -- but it will at least throw an error rather
than segfault.)  CB_setf() won't work for PostingPool, but there are still
plenty of alternatives to Mimic_Str().  With a little work, I think we can
eliminate with a Mimic_Str().

But we might also treat it like these:

>     * Cat_Str
>     * Cat_Trusted_Str
>     * Starts_With_Str
>     * Ends_With_Str
>     * Find_Str
>     * Equals_Str
>
> These should probably expect the char* to always be in UTF-8.

Yes.  And I think we should rename them *_UTF8 instead of *_Str to reflect
that fact.  That will both clear up what encoding they expect and eliminate
potential confusion with String.

For the record, I don't imagine any of these functions ever getting exposed
outside of a C context.

Marvin Humphrey

Mime
View raw message