couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <jch...@apache.org>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Mon, 22 Jun 2009 16:21:44 GMT
On Mon, Jun 22, 2009 at 7:36 AM, Noah Slater<nslater@apache.org> wrote:
> On Sun, Jun 21, 2009 at 11:21:00PM -0700, Chris Anderson wrote:
>> My gut reaction is that normalizing strings using NFC [1] is not appropriate
>> for a database. Here's why we should treat strings as binary and not worry
>> about unicode normalization at all:
> [...]
>> First of all, I'm certain we can't require that all input already be NFC
>> normalized.
> [...]
>> Secondly, we're a database, so I find highly suspicious the notion that we
>> should auto-normalize user input on-the-quiet.
> [...]
>> So we can't require normalized input and we can't auto-normalize.
>
> CouchDB would create a canonicalised copy of the document while creating the
> document hash. There is no reason why CouchDB, or the clients, should worry
> about canonicalising the actual documents.
>
>> Where does this leave us?
>
> Canonicalisation is a temporary step, so there are no problems.
>

Works for me, then. Hadn't really considered that we won't be saving
the canonicalized versions, just using them as input to the hash
function.

My larger point is that normalization is basically an optimization.
Occasionally getting the hash wrong (for whatever reason) is just
going to result in spurious conflicts, which aren't critical errors,
just an annoyance as the application will sometimes have to repair the
conflicts. Presumably you'll already be repairing conflicts, so fixing
ones that result from spurious conflicts (especially as they are so
rare) is not a big cost, and should be easy as the application can
just pick a random version of the doc, with no ill effects.

I think this makes the NFC stuff a nice-to-have, not a necessity.

Chris

>> > Unicode normalisation is an issue for clients because it requires they have
>> > access to a Unicode NFC function.
>
> Why would clients need to worry about this? CouchDB is creating the hashes.
>
> Best,
>
> --
> Noah Slater, http://tumbolia.org/nslater
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Mime
View raw message