couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Mon, 22 Jun 2009 19:15:24 GMT
On Mon, Jun 22, 2009 at 2:46 PM, Noah Slater<nslater@apache.org> wrote:
> On Mon, Jun 22, 2009 at 09:21:44AM -0700, Chris Anderson wrote:
>> My larger point is that normalization is basically an optimization.
>
> Optimisation of what? Unicode normalisation should be considered absolutely
> critical to any canonical form. If we want to use some proprietary algorithm for
> determining a document hash, then fine - but if we are advertising that it is
> calculated from some canonical serialisation, then Unicode normalisation really
> is a base requirement for that.
>

I think he means optimization in so much as that the deterministic
revision algorithm still works regardless. If clients want to avoid
spurious conflicts, then they should send normalized unicode to avoid
the issue. In other words, if we just write the algorithm to not care
about normalization it'll solve lots of cases for free and the cases
that aren't solved can be solved if the client so desires.

>> Occasionally getting the hash wrong (for whatever reason) is just
>> going to result in spurious conflicts, which aren't critical errors,
>> just an annoyance as the application will sometimes have to repair the
>> conflicts. Presumably you'll already be repairing conflicts, so fixing
>> ones that result from spurious conflicts (especially as they are so
>> rare) is not a big cost, and should be easy as the application can
>> just pick a random version of the doc, with no ill effects.
>
> As native English speakers, it's fairly easy for us to assume that most
> documents are comprised of some simple Latin character subset. As soon as you
> start working with languages that make heavy use of combining characters -
> accents, diacritical marks, etc - then this character normalisation becomes a
> major issue in a multi-user environment.
>
> Consider the following byte sequences:
>
>  U+006B U+014D U+0061 U+006E
>
>  U+006B U+006F U+0304 U+0061 U+006E
>
> Both of these look like "kōan" yet the byte sequence depends on my input method.
>

The question is what are we trying to accomplish. We could sit around
and argue whether we're creating revisions of the sequences of code
points or sequences of characters. I'm more than happy to say
sequences of bytes (and thus code points) because that is applicable
to most clients and does not prevent the algorithm from working on
normalized unicode.

>> I think this makes the NFC stuff a nice-to-have, not a necessity.
>
> I disagree strongly.
>
> Best,
>
> --
> Noah Slater, http://tumbolia.org/nslater
>

The thing that worries me most about normalization is that we would
end up causing more problems by being complete than if we just took
the naive route. Requiring that a client have an implementation of
normalization byte identical to the one CouchDB uses instead of just
being internally consistent seems like it could trip up alot of
clients. Granted there's a lot of other corners to get snagged on, so
perhaps for the time being we should just say screw it, implement
something and see what clients have problems. Even adding
normalization as part of the step or not I could go with a coin toss
at this point assuming our ICU library dependency can do it.

Either way, enough hand waving for one day.

Mime
View raw message