couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: View keys case-insensitive?
Date Thu, 09 Apr 2009 19:28:12 GMT
I've spent entirely too long on this now and I still can't for the
life of me figure out why A < aa.

So far I've found out that unicode collation is rather crazy
complicated. Which might be obvious, but really, 3 hours of reading
and I'm still no closer to figuring out why a < A < aa.

Beyond that, we also need to add the ICU library version to anything
that uses collation. There are warnings in the collation documentation
that these things can change due to both language experts adding
knowledge to the algorithms as well as governments making changes.
Yeah, start petitioning your local representative to maintain the
status quo for collations. `$SOVIET_RUSSIA_JOKE` lolz.

So anyway, I can make A < a < aa but not a < aa < A and I have no idea
why. So if anyone wants to sift through the ICU collation literature
for and figure out the options that need tweaked to change it as such
that'd be cool. Also, for reference, the configurable collation
options should be pretty easy to accomplish by just passing a string
that the ICU library uses to setup the different rule systems. Anyway,
I'm done reading about things like how v and w are the same letter in
Swedish.

HTH,
Paul Davis

On Thu, Apr 9, 2009 at 1:00 PM, Damien Katz <damien@apache.org> wrote:
> User collation settings (case, accent, sensitive, locale, etc) should be an
> option for views if anyone wants to take that on.
>
> On Apr 9, 2009, at 12:49 PM, Paul Davis wrote:
>
>> Oddly enough, this is expected behavior:
>>
>>   values.push("a");
>>   values.push("A");
>>   values.push("aa");
>>   values.push("b");
>>   values.push("B");
>>   values.push("ba");
>>   values.push("bb");
>>
>> Even fiddling with the ICU collation options I couldn't get it to sort
>> any differently.
>
> Did you recreate the indexes from scratch? Otherwise they'll still be sorted
> with the old collation.
>
> -Damien
>
>
>> I'm not sure if there's an explanation that I'm
>> missing or what but it sure seems like "aa" should come before "A" for
>> case sensitive sorting. Unless of course its doing something dumb like
>> sorting right to left in which case "a" > null.
>>
>> No idea.
>>
>> Paul Davis
>>
>> Index: src/couchdb/couch_erl_driver.c
>> ===================================================================
>> --- src/couchdb/couch_erl_driver.c      (revision 762581)
>> +++ src/couchdb/couch_erl_driver.c      (working copy)
>> @@ -22,6 +22,8 @@
>> #define U_DISABLE_RENAMING 1
>> #endif
>>
>> +#include <stdio.h>
>> +
>> #include "erl_driver.h"
>> #include "unicode/ucol.h"
>> #include "unicode/ucasemap.h"
>> @@ -63,13 +65,25 @@
>>        return ERL_DRV_ERROR_GENERAL;
>>    }
>>
>> +    ucol_setAttribute(pData->coll, UCOL_CASE_FIRST, UCOL_LOWER_FIRST,
>> &status);
>> +    if(U_FAILURE(status)) {
>> +        couch_drv_stop((ErlDrvData)pData);
>> +        return ERL_DRV_ERROR_GENERAL;
>> +    }
>> +
>> +    ucol_setAttribute(pData->coll, UCOL_CASE_LEVEL, UCOL_ON, &status);
>> +    if(U_FAILURE(status)) {
>> +        couch_drv_stop((ErlDrvData)pData);
>> +        return ERL_DRV_ERROR_GENERAL;
>> +    }
>> +
>>    pData->collNoCase = ucol_open("", &status);
>>    if (U_FAILURE(status)) {
>>        couch_drv_stop((ErlDrvData)pData);
>>        return ERL_DRV_ERROR_GENERAL;
>>    }
>>
>>  On Thu, Apr 9, 2009 at 6:53 AM, Brian Candler <B.Candler@pobox.com>
>> wrote:
>>>
>>> I was very surprised to find that view keys seem to be case-insensitive
>>> when
>>> using startkey and endkey:
>>>
>>> $ curl -X POST -d '{"map":"function(doc) { emit(doc.foo, null); }"}'
>>> 'http://127.0.0.1:5984/test_suite_db/_temp_view?startkey="a"&endkey="az"'
>>> {"total_rows":26,"offset":7,"rows":[
>>> {"id":"7","key":"a","value":null},
>>> {"id":"8","key":"A","value":null},    <<<< huh?!
>>> {"id":"9","key":"aa","value":null}
>>> ]}
>>>
>>> But not when fetching them individually:
>>>
>>> $ curl -X POST -d '{"map":"function(doc) { emit(doc.foo, null); }"}'
>>> 'http://127.0.0.1:5984/test_suite_db/_temp_view?key="a"'
>>> {"total_rows":26,"offset":7,"rows":[
>>> {"id":"7","key":"a","value":null}
>>> ]}
>>> $ curl -X POST -d '{"map":"function(doc) { emit(doc.foo, null); }"}'
>>> 'http://127.0.0.1:5984/test_suite_db/_temp_view?key="A"'
>>> {"total_rows":26,"offset":8,"rows":[
>>> {"id":"8","key":"A","value":null}
>>> ]}
>>>
>>> (Ditto for startkey="a"&endkey="a", or startkey="A"&endkey="A")
>>>
>>> At http://wiki.apache.org/couchdb/View_collation it says that view keys
>>> are
>>> case-sensitive, which normally means that "A" does not appear in the
>>> range
>>> "a" to "aa". And with normal ASCII ordering I would expect "A" to sort
>>> before "a", as is the case with Javascript:
>>>
>>> js> "a" < "A"
>>> false
>>>
>>> Could someone please explain to me what's going on? This may also explain
>>> my
>>> recent report COUCHDB-324 where tilde does not collate where I'd expect.
>>>
>>> I am running a recent SVN build:
>>> {"couchdb":"Welcome","version":"0.9.0a762247"}
>>>
>>> Thanks,
>>>
>>> Brian.
>>>
>
>

Mime
View raw message