couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <>
Subject Re: Locale and rule based view collation
Date Mon, 27 Sep 2010 07:08:03 GMT
On Mon, Sep 27, 2010 at 2:59 AM, Noah Diewald <> wrote:
> On Sun, Sep 26, 2010 at 7:43 PM, Paul Davis <> wrote:
>> On Sun, Sep 26, 2010 at 8:37 PM, Noah Diewald <> wrote:
>>> On Sat, Sep 25, 2010 at 6:38 PM, Paul Davis <>
>>>> On Sat, Sep 25, 2010 at 7:21 PM, Chris Anderson <>
>>>>> On Sat, Sep 18, 2010 at 4:47 PM, Noah Diewald <>
>>>>>> I was wondering if there were any plans to make use of more of the
>>>>>> collation API in CouchDB.
>>>>>> I'm using CouchDB to make natural language documentation software
>>>>>> it seems like a shame that I might have to use ICU for creating sort
>>>>>> keys to get sort orders right for view keys in certain languages
>>>>>> ICU is already used internally by CouchDB. It kind of looks like
>>>>>> something could be added in at about the same place as the option
>>>>>> case or no case collations in couch_icu_driver.c but I feel under
>>>>>> qualified to play around with it. I think that having an option in
>>>>>> view to specify collation customization would be really great and
>>>>>> must be something that even people working with less obscure languages
>>>>>> than I am could benefit from.
>>>>> we definitely plan to make this configurable, just a matter of writing
>>>>> code. for now there might be a way to set it on a per-server-instance
>>>>> basis with environment variables. I am no expert on the topic, but I
>>>>> vaguely recall someone mentioning this possibility.
>>>>> Chris
>>>>>> --
>>>>>> Noah Diewald
>>>>> --
>>>>> Chris Anderson
>>>> I'm pretty sure that Chris is right that there's a server wide
>>>> environment setting that affects ICU collation, but I can't say with
>>>> any certainty.
>>>> Its always been on the to-do list to provide the ability to have
>>>> language based sorts that are defined at the view or database level,
>>>> but as Chris points out, no one's gotten around to doing that.
>>>> Currently the major issues would revolve around recoding the
>>>> icu_driver to have smarts in how it's created, as well as refactoring
>>>> how we access the driver.
>>>> If we bumped our minimum Erlang VM version to R13, writing this as a
>>>> NIF would probably be orders of magnitude easier because of resource
>>>> types and what not.
>>>> Once those hard parts are figured out, exposing it to the outside
>>>> world should be as easy as going through the bike shedding motions on
>>>> what the _design/doc syntax would look like.
>>>> HTH,
>>>> Paul Davis
>>> It is great to know that this type of thing is on the todo list. If
>>> custom rules were supported and not just predefined locales, some of
>>> the questionable NIFs I'm writing to make sort keys in my application
>>> layer could be removed some day and life would be simpler.
>>> I don't think that the environment variables help me personally with
>>> supporting multiple languages with different sort orders, especially
>>> since the collation customizations for two of the languages that I'm
>>> focusing on require custom rules. It would be really awesome if
>>> CouchDB supported ICU custom collation rules in views right out of the
>>> box. It might go a long way to making CouchDB a favorite with
>>> linguists. (CouchDB should be a favorite with linguists anyway because
>>> it is such a pleasure to use but this could make it extra favorite.)
>>> Thank you both for the replies.
>>> --
>>> Noah Diewald
>> I'm not sure what you mean by custom rules. I'm not extremely familiar
>> with the collation API, but as I recall it had a thing that allowed a
>> user to pass a string based config to it that it would use to affect
>> the collation algorithm. Are you needing something beyond that?
>> Paul Davis
> I don't think I'm needing anything more if we're talking about the
> same thing but maybe we're not.
> Sorry about the "customization rule" stuff. Now that I look back, the
> ICU documentation consistently calls them tailoring rules, sorry to be
> unclear. I'm just learning this stuff.
> Here is my understanding of instantiating ICU collators just to see if
> we are on the same page.
> There are two ways of instantiating collators. The predefined
> collators are instantiated with locale strings like "en_US". Custom
> collators are instantiated using tailoring rules.[1]
> The ICU users guide says that a tailoring rule "overrides the default
> order of code points and the values of the ICU Collation Service
> attributes".[2], which seems like a strange definition because
> tailoring allows one to specify complex base letters that consist of
> more than one code point. UTS 10 says "Tailoring is any well-defined
> syntax that takes the Default Unicode Collation Element Table and
> produces another well-formed Unicode Collation Element Table."[3] In
> ICU a tailoring rule is a string that looks like this:
> "& C < č <<< Č < ć <<< Ć"
> So a string is used for configuration in both cases of collator
> instantiation but a different api function is used to instantiate the
> collator depending on whether one is using a predefined collator or a
> tailoring rule. Any way of instantiating an ICU collator other than
> passing in an empty string or "root" as the locale may or may not
> result in  a custom UCET derived from the DUCET so it was not a good
> idea to just talk about customization since that is vague.
> I'm dealing with languages that require tailoring and it is likely
> that most people wouldn't need tailoring just to be able to use a
> specific language for a specific view and that specifying a locale
> would be just fine. On the other hand, tailoring is very powerful and
> could be used to customize collation for reasons other than matching
> the alphabet of a rare language.
> Another aspect of what I need is that I specifically need different
> collation algorithms for different views. In one case I'll want to
> sort by English, in another I'll want to sort by Potawatomi or
> Menominee or something else.
> 1.
> 2.
> 3.
> --
> Noah Diewald

Cool. I'm concerned about a small API difference that gets selected.
Was just concerned for a bit that you were doing things like passing
function pointers to an API which would increase the overhead by a
couple orderes of magnitude. My earlier characterization of the level
of difficulty is about at the right level still.

Paul Davis

View raw message