couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noah Slater <nsla...@apache.org>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Tue, 23 Jun 2009 01:52:47 GMT
On Mon, Jun 22, 2009 at 04:22:10PM -0400, Paul Davis wrote:
> On Mon, Jun 22, 2009 at 3:32 PM, Noah Slater<nslater@apache.org> wrote:
> >
> >  * calculate the document hash from a canonical binary serialisation
> >
> >  * calculate the document hash from the binary serialisation
[...]
>
>  * calculate the document hash from the deterministic binary serialization

Okay, so we have three options:

  * hash the document, binary identical to what the client sent

  * hash the document, with sorted keys, but binary identical key/value data

  * hash the document, with sorted keys, and Unicode normalised key/value data

This raise the obvious question:

  Could Unicode normalisation affect the sorting of keys with option two?

Anyone know much about collation algorithms?

> Obviously, regardless of what we choose to implement, safe-writes are kosher
> because all the server is the only one ever doing the actual calculation. The
> unicode normalization issue crops up if two different clients write the same
> edit to multiple hosts. If the clients don't use the same normalization scheme
> then we still introduce the spurious conflicts (that would currently be
> introduced no matter what) on replication.

Nope, this is exactly what Unicode normalisation is designed to solve. If two
different clients write ESSENTIALLY the same edit to two different nodes using
two different methods of combining characters, then Unicode normalisation will
see these as the same edit. Without normalisation, these two will have a
different document hash, and will conflict.

Best,

-- 
Noah Slater, http://tumbolia.org/nslater

Mime
View raw message