incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Locale and rule based view collation
Date Mon, 27 Sep 2010 07:08:03 GMT
On Mon, Sep 27, 2010 at 2:59 AM, Noah Diewald <noah.diewald@gmail.com> wrote:
> On Sun, Sep 26, 2010 at 7:43 PM, Paul Davis <paul.joseph.davis@gmail.com> wrote:
>> On Sun, Sep 26, 2010 at 8:37 PM, Noah Diewald <noah.diewald@gmail.com> wrote:
>>> On Sat, Sep 25, 2010 at 6:38 PM, Paul Davis <paul.joseph.davis@gmail.com>
wrote:
>>>> On Sat, Sep 25, 2010 at 7:21 PM, Chris Anderson <jchris@apache.org>
wrote:
>>>>> On Sat, Sep 18, 2010 at 4:47 PM, Noah Diewald <noah.diewald@gmail.com>
wrote:
>>>>>> I was wondering if there were any plans to make use of more of the
ICU
>>>>>> collation API in CouchDB.
>>>>>>
>>>>>> I'm using CouchDB to make natural language documentation software
and
>>>>>> it seems like a shame that I might have to use ICU for creating sort
>>>>>> keys to get sort orders right for view keys in certain languages
when
>>>>>> ICU is already used internally by CouchDB. It kind of looks like
>>>>>> something could be added in at about the same place as the option
for
>>>>>> case or no case collations in couch_icu_driver.c but I feel under
>>>>>> qualified to play around with it. I think that having an option in
the
>>>>>> view to specify collation customization would be really great and
it
>>>>>> must be something that even people working with less obscure languages
>>>>>> than I am could benefit from.
>>>>>>
>>>>>
>>>>> we definitely plan to make this configurable, just a matter of writing
>>>>> code. for now there might be a way to set it on a per-server-instance
>>>>> basis with environment variables. I am no expert on the topic, but I
>>>>> vaguely recall someone mentioning this possibility.
>>>>>
>>>>> Chris
>>>>>
>>>>>> --
>>>>>> Noah Diewald
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Chris Anderson
>>>>> http://jchrisa.net
>>>>> http://couch.io
>>>>>
>>>>
>>>> I'm pretty sure that Chris is right that there's a server wide
>>>> environment setting that affects ICU collation, but I can't say with
>>>> any certainty.
>>>>
>>>> Its always been on the to-do list to provide the ability to have
>>>> language based sorts that are defined at the view or database level,
>>>> but as Chris points out, no one's gotten around to doing that.
>>>> Currently the major issues would revolve around recoding the
>>>> icu_driver to have smarts in how it's created, as well as refactoring
>>>> how we access the driver.
>>>>
>>>> If we bumped our minimum Erlang VM version to R13, writing this as a
>>>> NIF would probably be orders of magnitude easier because of resource
>>>> types and what not.
>>>>
>>>> Once those hard parts are figured out, exposing it to the outside
>>>> world should be as easy as going through the bike shedding motions on
>>>> what the _design/doc syntax would look like.
>>>>
>>>> HTH,
>>>> Paul Davis
>>>>
>>>
>>> It is great to know that this type of thing is on the todo list. If
>>> custom rules were supported and not just predefined locales, some of
>>> the questionable NIFs I'm writing to make sort keys in my application
>>> layer could be removed some day and life would be simpler.
>>>
>>> I don't think that the environment variables help me personally with
>>> supporting multiple languages with different sort orders, especially
>>> since the collation customizations for two of the languages that I'm
>>> focusing on require custom rules. It would be really awesome if
>>> CouchDB supported ICU custom collation rules in views right out of the
>>> box. It might go a long way to making CouchDB a favorite with
>>> linguists. (CouchDB should be a favorite with linguists anyway because
>>> it is such a pleasure to use but this could make it extra favorite.)
>>>
>>> Thank you both for the replies.
>>>
>>> --
>>> Noah Diewald
>>>
>>
>> I'm not sure what you mean by custom rules. I'm not extremely familiar
>> with the collation API, but as I recall it had a thing that allowed a
>> user to pass a string based config to it that it would use to affect
>> the collation algorithm. Are you needing something beyond that?
>>
>> Paul Davis
>>
>
> I don't think I'm needing anything more if we're talking about the
> same thing but maybe we're not.
>
> Sorry about the "customization rule" stuff. Now that I look back, the
> ICU documentation consistently calls them tailoring rules, sorry to be
> unclear. I'm just learning this stuff.
>
> Here is my understanding of instantiating ICU collators just to see if
> we are on the same page.
>
> There are two ways of instantiating collators. The predefined
> collators are instantiated with locale strings like "en_US". Custom
> collators are instantiated using tailoring rules.[1]
>
> The ICU users guide says that a tailoring rule "overrides the default
> order of code points and the values of the ICU Collation Service
> attributes".[2], which seems like a strange definition because
> tailoring allows one to specify complex base letters that consist of
> more than one code point. UTS 10 says "Tailoring is any well-defined
> syntax that takes the Default Unicode Collation Element Table and
> produces another well-formed Unicode Collation Element Table."[3] In
> ICU a tailoring rule is a string that looks like this:
>
> "& C < č <<< Č < ć <<< Ć"
>
> So a string is used for configuration in both cases of collator
> instantiation but a different api function is used to instantiate the
> collator depending on whether one is using a predefined collator or a
> tailoring rule. Any way of instantiating an ICU collator other than
> passing in an empty string or "root" as the locale may or may not
> result in  a custom UCET derived from the DUCET so it was not a good
> idea to just talk about customization since that is vague.
>
> I'm dealing with languages that require tailoring and it is likely
> that most people wouldn't need tailoring just to be able to use a
> specific language for a specific view and that specifying a locale
> would be just fine. On the other hand, tailoring is very powerful and
> could be used to customize collation for reasons other than matching
> the alphabet of a rare language.
>
> Another aspect of what I need is that I specifically need different
> collation algorithms for different views. In one case I'll want to
> sort by English, in another I'll want to sort by Potawatomi or
> Menominee or something else.
>
> 1. http://userguide.icu-project.org/collation/api
> 2. http://userguide.icu-project.org/collation/customization
> 3. http://www.unicode.org/reports/tr10/
>
> --
> Noah Diewald
>

Cool. I'm concerned about a small API difference that gets selected.
Was just concerned for a bit that you were doing things like passing
function pointers to an API which would increase the overhead by a
couple orderes of magnitude. My earlier characterization of the level
of difficulty is about at the right level still.

Paul Davis

Mime
View raw message