couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Tue, 23 Jun 2009 04:46:36 GMT
On Mon, Jun 22, 2009 at 9:52 PM, Noah Slater<nslater@apache.org> wrote:
> On Mon, Jun 22, 2009 at 04:22:10PM -0400, Paul Davis wrote:
>> On Mon, Jun 22, 2009 at 3:32 PM, Noah Slater<nslater@apache.org> wrote:
>> >
>> >  * calculate the document hash from a canonical binary serialisation
>> >
>> >  * calculate the document hash from the binary serialisation
> [...]
>>
>>  * calculate the document hash from the deterministic binary serialization
>
> Okay, so we have three options:
>
>  * hash the document, binary identical to what the client sent
>
>  * hash the document, with sorted keys, but binary identical key/value data
>
>  * hash the document, with sorted keys, and Unicode normalised key/value data
>
> This raise the obvious question:
>
>  Could Unicode normalisation affect the sorting of keys with option two?
>

Noper.

UCA Step 1: Convert input strings to Normalized Form D

http://unicode.org/reports/tr10/#Main_Algorithm

>> Obviously, regardless of what we choose to implement, safe-writes are kosher
>> because all the server is the only one ever doing the actual calculation. The
>> unicode normalization issue crops up if two different clients write the same
>> edit to multiple hosts. If the clients don't use the same normalization scheme
>> then we still introduce the spurious conflicts (that would currently be
>> introduced no matter what) on replication.
>
> Nope, this is exactly what Unicode normalisation is designed to solve. If two
> different clients write ESSENTIALLY the same edit to two different nodes using
> two different methods of combining characters, then Unicode normalisation will
> see these as the same edit. Without normalisation, these two will have a
> different document hash, and will conflict.
>

Slow down a bit before you discard out of hand. That is precisely what
I'm saying: we might not prevent conflicts in every case that we could
if we included normalization. Although, the fun part of this whole
exercise is that we can all have our own awesomely independent
interpretation of what "essentially the same edit" means. The explicit
question we are considering is this:

Is the code point U+00C7 equal to the code point sequence U+0043 U+0327 ?

The two answers I see are:

Yes - Unicode defines U+00C7 as a canonical equivalent of U+0043 U+0327
No - The UTF-8 byte representation of the two sequences are different.

Reasons for implementing according to Yes:

  * We avoid the maximum number of spurious conflicts in the
safe-writes example without requiring all clients to implement unicode
normalization.

Reasons for implementing according to No:

  * Anyone implementing the deterministic revision algorithm MUST do
unicode normalization.
  * Clients are still free to send normalized unicode thus receiving
the previously described benefits.
  * It may be possible that there's a use case in which it is
desirable for the different code point combinations to result in a
conflict.
  * Performance (Ie, the same reason _all_docs no longer uses UCA)

> Best,
>
> --
> Noah Slater, http://tumbolia.org/nslater
>

As an aside, as I read the Unicode FAQ at [1] we are not required to
recognize the canonical equivalence of code point sequences.

[1] http://unicode.org/faq/char_combmark.html#8

HTH,
Paul Davis

Mime
View raw message