couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <jch...@apache.org>
Subject Re: View keys case-insensitive?
Date Thu, 09 Apr 2009 23:43:08 GMT
On Thu, Apr 9, 2009 at 4:31 PM, Paul Davis <paul.joseph.davis@gmail.com> wrote:
> Perhaps I'm not explaining my expectation well enough.
>
> The way I read the algorithm, the basic idea is that you take the
> weights for each character and concatenate them. then you run throw
> these representations and do a basic element wise comparison.
>
> I know that a and A have the same primary and secondary weights. But
> they have different tertiary weights.
>
> Thus as I read the algorithm (most likely missing the important clause
> affecting this) then when I compare a and A, A > a. I don't see why
> any other character is considered. A is fucking bigger than a. If
> there is a section that says, "Oh, btw, if one string is an exact
> prefix of the other as defined solely by primary weights, then the
> prefix sorts first" i would be happy as a pig in shit.
>
>
>
> On Thu, Apr 9, 2009 at 7:18 PM, Patrick Antivackis
> <patrick.antivackis@gmail.com> wrote:
>> Paul,
>>
>> 2009/4/10 Paul Davis <paul.joseph.davis@gmail.com>
>>
>>> I've tried various combinations of UC_CASE_LEVEL, UC_CASE_FIRST, and
>>> UC_WEIGHT.
>>>
>>
>> This is really not enough. Doing this you only try to say to the collation
>> that  a<<<A or A<<<a (third element)
>> but still it's an a upper, lower, witha tilde, with an accent or wathever.
>> All are just variation of A but still A.
>>
>> If you look at :
>> http://www.unicode.org/Public/UCA/latest/allkeys.txt
>>
>> and search for :
>>
>> 0061  ; [.1141.0020.0002.0061] # LATIN SMALL LETTER A
>>
>>
>> you will see a lot of A definition, but all have the same first element
>> :1141, they are all the same letter, other are just variation. So compared
>> to each of them they have an order but compared with an other letter they
>> all behave the same like an A
>>
>> So, now if you want to change order of primary element, you need to use
>> custom tailoring :
>> http://userguide.icu-project.org/collation/customization
>>
>> And you need to say thing like :
>> a < A (primary order)
>> So to simulate ASCII behaviour you should try something like :  a < b < c
<d
>> < ......<A <B <C ....., so almost retype the ASCII table.
>>
>> To be honest, i not tried, but that should work
>>
>
> Let me reiterate. I do *not* want A > b. I want A fucking greater than
> a. Or not greater. But this relation:
>
> a < A < aa
>
> to me means:
>
> a < A < a
>
> Am I the only one that finds that just a bit ridiculous?
>

I read it as case-insensitive (aka a == A) except with a deterministic
winner (lowercase first) when the only difference between two strings
are due to case.

> HTH,
> Paul Davis
>
> p.s. the cursing isn't directed at anyone here. I'm just fairly
> frustrated by that unicode algorithm.
>
>>
>>
>>
>>
>>
>>> Also, I still don't see anything in this damned collation algorithm
>>> that explains how A < aa.  And this doesn't fall into the big/biggest
>>> comparison. The similar case would be Big < biggest. But I don't see
>>> anything in the damn collation algorithm document that talks about
>>> ignoring anything after primary weight in the case that one string is
>>> a prefix. In the various examples that I see I can't find anything
>>> that would contradict that expectation.
>>>
>>> Paul Davis
>>>
>>> For reference, the algorithm reference I'm using is this one:
>>> http://unicode.org/reports/tr10/
>>>
>>> I feel like printing the entire thing just so I can have a book burning.
>>>
>>> On Thu, Apr 9, 2009 at 6:39 PM, Patrick Antivackis
>>> <patrick.antivackis@gmail.com> wrote:
>>> > By the way, what customization did you try to send to ICU ?
>>> >
>>> > 2009/4/10 Paul Davis <paul.joseph.davis@gmail.com>
>>> >
>>> >> Patrick,
>>> >>
>>> >> I'm not asking for this relationship:
>>> >>
>>> >> a < b < A < B
>>> >>
>>> >> Merely:
>>> >>
>>> >> a < aa < A
>>> >>
>>> >> The thing is that even when I try and specify explicitly that 'A;
>>> >> should come after 'a' I still can't get the expected "a < aa <
A"
>>> >> behavior. In a nutshell, "Why the hell does the second 'a' alter the
>>> >> comparison?"
>>> >>
>>> >> HTH,
>>> >> Paul Davis
>>> >>
>>> >> On Thu, Apr 9, 2009 at 5:45 PM, Patrick Antivackis
>>> >> <patrick.antivackis@gmail.com> wrote:
>>> >> > It's quite normal as far as ICU is concerned.
>>> >> > ICU is about language not about ASCII code.
>>> >> > In ICU, case is the third element looked for comparison (same level
>>> than
>>> >> > circled letter in Nordic languages for example), so not very
>>> important.
>>> >> > So when you sort words together, a or A is still an a, so they
are
>>> sorted
>>> >> > nearby. In ICU you can specify if you prefer a before A or A before
a,
>>> >> but
>>> >> > not simply a before b before c.... before A before B before C.
>>> >> >
>>> >> > To have such behavior (like ASCII) you need to custom ICU in
>>> specifying
>>> >> the
>>> >> > collation you want almost letter by letter.
>>> >> > It is great for you, but what about Japanese users or Arabic users
??
>>> >> >
>>> >> > So this is definitely the right behaviour of ICU sorting (collation).
>>> >> >
>>> >> >
>>> >> > 2009/4/9 Brian Candler <B.Candler@pobox.com>
>>> >> >
>>> >> >> > I've spent entirely too long on this now and I still can't
for the
>>> >> >> > life of me figure out why A < aa.
>>> >> >>
>>> >> >> Time for an experimental, black-box approach:
>>> >> >>
>>> >> >> ----
>>> >> >> require 'rubygems'
>>> >> >> require 'restclient'
>>> >> >> require 'json'
>>> >> >>
>>> >> >> DB="http://127.0.0.1:5984/collator"
>>> >> >>
>>> >> >> RestClient.delete DB rescue nil
>>> >> >> RestClient.put "#{DB}",""
>>> >> >>
>>> >> >> (32..126).each do |c|
>>> >> >>  RestClient.put "#{DB}/#{c.to_s(16)}", {"x"=>c.chr}.to_json
>>> >> >> end
>>> >> >>
>>> >> >> RestClient.put "#{DB}/_design/test", <<EOS
>>> >> >> {
>>> >> >>  "views":{
>>> >> >>    "one":{
>>> >> >>      "map":"function (doc) { emit(doc.x,null); }"
>>> >> >>    }
>>> >> >>  }
>>> >> >> }
>>> >> >> EOS
>>> >> >>
>>> >> >> puts RestClient.get("#{DB}/_design/test/_view/one")
>>> >> >> ----
>>> >> >>
>>> >> >> This shows the collation sequence to be as follows.
>>> >> >>
>>> >> >> {"total_rows":95,"offset":0,"rows":[
>>> >> >> {"id":"20","key":" ","value":null},
>>> >> >> {"id":"60","key":"`","value":null},
>>> >> >> {"id":"5e","key":"^","value":null},
>>> >> >> {"id":"5f","key":"_","value":null},
>>> >> >> {"id":"2d","key":"-","value":null},
>>> >> >> {"id":"2c","key":",","value":null},
>>> >> >> {"id":"3b","key":";","value":null},
>>> >> >> {"id":"3a","key":":","value":null},
>>> >> >> {"id":"21","key":"!","value":null},
>>> >> >> {"id":"3f","key":"?","value":null},
>>> >> >> {"id":"2e","key":".","value":null},
>>> >> >> {"id":"27","key":"'","value":null},
>>> >> >> {"id":"22","key":"\"","value":null},
>>> >> >> {"id":"28","key":"(","value":null},
>>> >> >> {"id":"29","key":")","value":null},
>>> >> >> {"id":"5b","key":"[","value":null},
>>> >> >> {"id":"5d","key":"]","value":null},
>>> >> >> {"id":"7b","key":"{","value":null},
>>> >> >> {"id":"7d","key":"}","value":null},
>>> >> >> {"id":"40","key":"@","value":null},
>>> >> >> {"id":"2a","key":"*","value":null},
>>> >> >> {"id":"2f","key":"/","value":null},
>>> >> >> {"id":"5c","key":"\\","value":null},
>>> >> >> {"id":"26","key":"&","value":null},
>>> >> >> {"id":"23","key":"#","value":null},
>>> >> >> {"id":"25","key":"%","value":null},
>>> >> >> {"id":"2b","key":"+","value":null},
>>> >> >> {"id":"3c","key":"<","value":null},
>>> >> >> {"id":"3d","key":"=","value":null},
>>> >> >> {"id":"3e","key":">","value":null},
>>> >> >> {"id":"7c","key":"|","value":null},
>>> >> >> {"id":"7e","key":"~","value":null},
>>> >> >> {"id":"24","key":"$","value":null},
>>> >> >> {"id":"30","key":"0","value":null},
>>> >> >> {"id":"31","key":"1","value":null},
>>> >> >> {"id":"32","key":"2","value":null},
>>> >> >> {"id":"33","key":"3","value":null},
>>> >> >> {"id":"34","key":"4","value":null},
>>> >> >> {"id":"35","key":"5","value":null},
>>> >> >> {"id":"36","key":"6","value":null},
>>> >> >> {"id":"37","key":"7","value":null},
>>> >> >> {"id":"38","key":"8","value":null},
>>> >> >> {"id":"39","key":"9","value":null},
>>> >> >> {"id":"61","key":"a","value":null},
>>> >> >> {"id":"41","key":"A","value":null},
>>> >> >> {"id":"62","key":"b","value":null},
>>> >> >> {"id":"42","key":"B","value":null},
>>> >> >> {"id":"63","key":"c","value":null},
>>> >> >> {"id":"43","key":"C","value":null},
>>> >> >> {"id":"64","key":"d","value":null},
>>> >> >> {"id":"44","key":"D","value":null},
>>> >> >> {"id":"65","key":"e","value":null},
>>> >> >> {"id":"45","key":"E","value":null},
>>> >> >> {"id":"66","key":"f","value":null},
>>> >> >> {"id":"46","key":"F","value":null},
>>> >> >> {"id":"67","key":"g","value":null},
>>> >> >> {"id":"47","key":"G","value":null},
>>> >> >> {"id":"68","key":"h","value":null},
>>> >> >> {"id":"48","key":"H","value":null},
>>> >> >> {"id":"69","key":"i","value":null},
>>> >> >> {"id":"49","key":"I","value":null},
>>> >> >> {"id":"6a","key":"j","value":null},
>>> >> >> {"id":"4a","key":"J","value":null},
>>> >> >> {"id":"6b","key":"k","value":null},
>>> >> >> {"id":"4b","key":"K","value":null},
>>> >> >> {"id":"6c","key":"l","value":null},
>>> >> >> {"id":"4c","key":"L","value":null},
>>> >> >> {"id":"6d","key":"m","value":null},
>>> >> >> {"id":"4d","key":"M","value":null},
>>> >> >> {"id":"6e","key":"n","value":null},
>>> >> >> {"id":"4e","key":"N","value":null},
>>> >> >> {"id":"6f","key":"o","value":null},
>>> >> >> {"id":"4f","key":"O","value":null},
>>> >> >> {"id":"70","key":"p","value":null},
>>> >> >> {"id":"50","key":"P","value":null},
>>> >> >> {"id":"71","key":"q","value":null},
>>> >> >> {"id":"51","key":"Q","value":null},
>>> >> >> {"id":"72","key":"r","value":null},
>>> >> >> {"id":"52","key":"R","value":null},
>>> >> >> {"id":"73","key":"s","value":null},
>>> >> >> {"id":"53","key":"S","value":null},
>>> >> >> {"id":"74","key":"t","value":null},
>>> >> >> {"id":"54","key":"T","value":null},
>>> >> >> {"id":"75","key":"u","value":null},
>>> >> >> {"id":"55","key":"U","value":null},
>>> >> >> {"id":"76","key":"v","value":null},
>>> >> >> {"id":"56","key":"V","value":null},
>>> >> >> {"id":"77","key":"w","value":null},
>>> >> >> {"id":"57","key":"W","value":null},
>>> >> >> {"id":"78","key":"x","value":null},
>>> >> >> {"id":"58","key":"X","value":null},
>>> >> >> {"id":"79","key":"y","value":null},
>>> >> >> {"id":"59","key":"Y","value":null},
>>> >> >> {"id":"7a","key":"z","value":null},
>>> >> >> {"id":"5a","key":"Z","value":null}
>>> >> >> ]}
>>> >> >>
>>> >> >> I've never seen this sequence before. It's not even EBCDIC
:-)
>>> >> >>
>>> >> >> Adding aa into the pot gives:
>>> >> >>
>>> >> >> ...
>>> >> >> {"id":"61","key":"a","value":null},
>>> >> >> {"id":"41","key":"A","value":null},
>>> >> >> {"id":"X","key":"aa","value":null},
>>> >> >> ...
>>> >> >>
>>> >> >> As you say, that is most bizarre.
>>> >> >>
>>> >> >> Cheers,
>>> >> >>
>>> >> >> Brian.
>>> >> >>
>>> >> >
>>> >>
>>> >
>>>
>>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Mime
View raw message