couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antony Blakey <antony.bla...@gmail.com>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Tue, 23 Jun 2009 15:15:58 GMT

On 23/06/2009, at 11:43 PM, Paul Davis wrote:

> Interesting point. I take this as a pretty clear reason for discarding
> UCA for member ordering. Normalization isn't affected by locale right?
> I haven't seen anything to suggest as such so I assume not.

IIUC each normalization form is strictly functional.

IMO, given that ICU provides normalization functions, CouchDB should  
use them in this case, exposing the canonicalisation transformation,  
and a shortcut producing the hash, as a client-accessible endpoint. I  
say this from a 'why not do it right?' perspective.

Member ordering could be binary, over either the code points (e.g. 32  
bits) or the bytes of the UTF-8 representation. Given the ease of  
creating a UTF-8 iterator that is probably best. UTF-16 is the most  
common native encoding, but you don't want to do a byte-level  
collation over a UTF-16/32 encoding because the result is dependent on  
byte ordering.

The problem with this is that the canonical form might look bizarre  
for a non-ASCII document, but a canonical collation is by definition  
always going to look wrong to someone. For the current use, as an  
intermediate form destined only for hashing, this doesn't matter anyway.

Having said that, IMO it would be a good i18n feature to be able set  
the locale of a database, maybe even at the granularity of a view,  
defaulting to the database's locale. The key ordering should respect  
that locale. An option to normalize keys would also be a good idea.  
The reason for setting a locale at the view level is that it might be  
useful to create multiple views with different locales, to present  
different localized result orderings to end users. One immediate issue  
is that the local would have to be injected into view servers to  
prevent possible weirdness.

I think it's easier and better to do these kind of things on the  
server because you know you have the facilities to do it there (e.g.  
ICU), whereas making it a client issue impedes use of the data by  
different clients.

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

On the other side, you have the customer and/or user, and they tend to  
do what we call "automating the pain." They say, "What is it we're  
doing now? How would that look if we automated it?" Whereas, what the  
design process should properly be is one of saying, "What are the  
goals we're trying to accomplish and how can we get rid of all this  
task crap?"
   -- Alan Cooper



Mime
View raw message