couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noah Diewald <>
Subject Re: Locale and rule based view collation
Date Mon, 27 Sep 2010 06:59:23 GMT
On Sun, Sep 26, 2010 at 7:43 PM, Paul Davis <> wrote:
> On Sun, Sep 26, 2010 at 8:37 PM, Noah Diewald <> wrote:
>> On Sat, Sep 25, 2010 at 6:38 PM, Paul Davis <> wrote:
>>> On Sat, Sep 25, 2010 at 7:21 PM, Chris Anderson <> wrote:
>>>> On Sat, Sep 18, 2010 at 4:47 PM, Noah Diewald <>
>>>>> I was wondering if there were any plans to make use of more of the ICU
>>>>> collation API in CouchDB.
>>>>> I'm using CouchDB to make natural language documentation software and
>>>>> it seems like a shame that I might have to use ICU for creating sort
>>>>> keys to get sort orders right for view keys in certain languages when
>>>>> ICU is already used internally by CouchDB. It kind of looks like
>>>>> something could be added in at about the same place as the option for
>>>>> case or no case collations in couch_icu_driver.c but I feel under
>>>>> qualified to play around with it. I think that having an option in the
>>>>> view to specify collation customization would be really great and it
>>>>> must be something that even people working with less obscure languages
>>>>> than I am could benefit from.
>>>> we definitely plan to make this configurable, just a matter of writing
>>>> code. for now there might be a way to set it on a per-server-instance
>>>> basis with environment variables. I am no expert on the topic, but I
>>>> vaguely recall someone mentioning this possibility.
>>>> Chris
>>>>> --
>>>>> Noah Diewald
>>>> --
>>>> Chris Anderson
>>> I'm pretty sure that Chris is right that there's a server wide
>>> environment setting that affects ICU collation, but I can't say with
>>> any certainty.
>>> Its always been on the to-do list to provide the ability to have
>>> language based sorts that are defined at the view or database level,
>>> but as Chris points out, no one's gotten around to doing that.
>>> Currently the major issues would revolve around recoding the
>>> icu_driver to have smarts in how it's created, as well as refactoring
>>> how we access the driver.
>>> If we bumped our minimum Erlang VM version to R13, writing this as a
>>> NIF would probably be orders of magnitude easier because of resource
>>> types and what not.
>>> Once those hard parts are figured out, exposing it to the outside
>>> world should be as easy as going through the bike shedding motions on
>>> what the _design/doc syntax would look like.
>>> HTH,
>>> Paul Davis
>> It is great to know that this type of thing is on the todo list. If
>> custom rules were supported and not just predefined locales, some of
>> the questionable NIFs I'm writing to make sort keys in my application
>> layer could be removed some day and life would be simpler.
>> I don't think that the environment variables help me personally with
>> supporting multiple languages with different sort orders, especially
>> since the collation customizations for two of the languages that I'm
>> focusing on require custom rules. It would be really awesome if
>> CouchDB supported ICU custom collation rules in views right out of the
>> box. It might go a long way to making CouchDB a favorite with
>> linguists. (CouchDB should be a favorite with linguists anyway because
>> it is such a pleasure to use but this could make it extra favorite.)
>> Thank you both for the replies.
>> --
>> Noah Diewald
> I'm not sure what you mean by custom rules. I'm not extremely familiar
> with the collation API, but as I recall it had a thing that allowed a
> user to pass a string based config to it that it would use to affect
> the collation algorithm. Are you needing something beyond that?
> Paul Davis

I don't think I'm needing anything more if we're talking about the
same thing but maybe we're not.

Sorry about the "customization rule" stuff. Now that I look back, the
ICU documentation consistently calls them tailoring rules, sorry to be
unclear. I'm just learning this stuff.

Here is my understanding of instantiating ICU collators just to see if
we are on the same page.

There are two ways of instantiating collators. The predefined
collators are instantiated with locale strings like "en_US". Custom
collators are instantiated using tailoring rules.[1]

The ICU users guide says that a tailoring rule "overrides the default
order of code points and the values of the ICU Collation Service
attributes".[2], which seems like a strange definition because
tailoring allows one to specify complex base letters that consist of
more than one code point. UTS 10 says "Tailoring is any well-defined
syntax that takes the Default Unicode Collation Element Table and
produces another well-formed Unicode Collation Element Table."[3] In
ICU a tailoring rule is a string that looks like this:

"& C < č <<< Č < ć <<< Ć"

So a string is used for configuration in both cases of collator
instantiation but a different api function is used to instantiate the
collator depending on whether one is using a predefined collator or a
tailoring rule. Any way of instantiating an ICU collator other than
passing in an empty string or "root" as the locale may or may not
result in  a custom UCET derived from the DUCET so it was not a good
idea to just talk about customization since that is vague.

I'm dealing with languages that require tailoring and it is likely
that most people wouldn't need tailoring just to be able to use a
specific language for a specific view and that specifying a locale
would be just fine. On the other hand, tailoring is very powerful and
could be used to customize collation for reasons other than matching
the alphabet of a rare language.

Another aspect of what I need is that I specifically need different
collation algorithms for different views. In one case I'll want to
sort by English, in another I'll want to sort by Potawatomi or
Menominee or something else.


Noah Diewald

View raw message