couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Tue, 23 Jun 2009 16:27:59 GMT
On Tue, Jun 23, 2009 at 11:15 AM, Antony Blakey<> wrote:
> On 23/06/2009, at 11:43 PM, Paul Davis wrote:
>> Interesting point. I take this as a pretty clear reason for discarding
>> UCA for member ordering. Normalization isn't affected by locale right?
>> I haven't seen anything to suggest as such so I assume not.
> IIUC each normalization form is strictly functional.
> IMO, given that ICU provides normalization functions, CouchDB should use
> them in this case, exposing the canonicalisation transformation, and a
> shortcut producing the hash, as a client-accessible endpoint. I say this
> from a 'why not do it right?' perspective.

Regardless of whether normalization is included in the deterministic
revision algorithm I think that normalization end points, or perhaps
query string parameters for performing normalization on incoming data
would be a good thing.

I reject the assertion that this is a question of right or wrong. I
see valid arguments for and against including normalization as part of
the deterministic revision algorithm. Considering a query parameter to
process incoming data conditionally I'm leaning towards not including
it, but I'm always open to being convinced otherwise.

> Member ordering could be binary, over either the code points (e.g. 32 bits)
> or the bytes of the UTF-8 representation. Given the ease of creating a UTF-8
> iterator that is probably best. UTF-16 is the most common native encoding,
> but you don't want to do a byte-level collation over a UTF-16/32 encoding
> because the result is dependent on byte ordering.

Are there byte order semantics for UTF-8? Or other cases where sorting
by UTF-8 binary representation is going to cause issues? Remember that
the end goal is to create deterministic serializations for hashing.
Sorting by code point doesn't seem like it'd get us anything other
than added complexity.

> The problem with this is that the canonical form might look bizarre for a
> non-ASCII document, but a canonical collation is by definition always going
> to look wrong to someone. For the current use, as an intermediate form
> destined only for hashing, this doesn't matter anyway.


> Having said that, IMO it would be a good i18n feature to be able set the
> locale of a database, maybe even at the granularity of a view, defaulting to
> the database's locale. The key ordering should respect that locale. An
> option to normalize keys would also be a good idea. The reason for setting a
> locale at the view level is that it might be useful to create multiple views
> with different locales, to present different localized result orderings to
> end users. One immediate issue is that the local would have to be injected
> into view servers to prevent possible weirdness.
> I think it's easier and better to do these kind of things on the server
> because you know you have the facilities to do it there (e.g. ICU), whereas
> making it a client issue impedes use of the data by different clients.

Patches welcome.

Paul Davis

> Antony Blakey
> -------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
> On the other side, you have the customer and/or user, and they tend to do
> what we call "automating the pain." They say, "What is it we're doing now?
> How would that look if we automated it?" Whereas, what the design process
> should properly be is one of saying, "What are the goals we're trying to
> accomplish and how can we get rid of all this task crap?"
>  -- Alan Cooper

View raw message