couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antony Blakey <antony.bla...@gmail.com>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Mon, 22 Jun 2009 15:19:32 GMT

On 23/06/2009, at 12:06 AM, Noah Slater wrote:

> On Sun, Jun 21, 2009 at 11:21:00PM -0700, Chris Anderson wrote:
>> My gut reaction is that normalizing strings using NFC [1] is not  
>> appropriate
>> for a database. Here's why we should treat strings as binary and  
>> not worry
>> about unicode normalization at all:
> [...]
>> First of all, I'm certain we can't require that all input already  
>> be NFC
>> normalized.
> [...]
>> Secondly, we're a database, so I find highly suspicious the notion  
>> that we
>> should auto-normalize user input on-the-quiet.
> [...]
>> So we can't require normalized input and we can't auto-normalize.
>
> CouchDB would create a canonicalised copy of the document while  
> creating the
> document hash. There is no reason why CouchDB, or the clients,  
> should worry
> about canonicalising the actual documents.
>
>> Where does this leave us?
>
> Canonicalisation is a temporary step, so there are no problems.

+1 to those two points.

>>> Unicode normalisation is an issue for clients because it requires  
>>> they have
>>> access to a Unicode NFC function.
>
> Why would clients need to worry about this? CouchDB is creating the  
> hashes.

At the moment, sure, but I was anticipating cases where this the  
canonical form, or a hash thereof would then creep into other contexts  
i.e. once you have the facility, who knows what you might want to do.  
OTOH, this could be dealt with via a canonicalisation service e.g.  
POST json payload(s), get back hashes of the canonical form(s) (or the  
forms themselves), which means that systems without access to unicode  
normalisation can still function with future facilities.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

The greatest challenge to any thinker is stating the problem in a way  
that will allow a solution
   -- Bertrand Russell


Mime
View raw message