couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noah Slater <nsla...@apache.org>
Subject Re: Unicode normalization (was Re: The 1.0 Thread)
Date Mon, 22 Jun 2009 18:46:40 GMT
On Mon, Jun 22, 2009 at 09:21:44AM -0700, Chris Anderson wrote:
> My larger point is that normalization is basically an optimization.

Optimisation of what? Unicode normalisation should be considered absolutely
critical to any canonical form. If we want to use some proprietary algorithm for
determining a document hash, then fine - but if we are advertising that it is
calculated from some canonical serialisation, then Unicode normalisation really
is a base requirement for that.

> Occasionally getting the hash wrong (for whatever reason) is just
> going to result in spurious conflicts, which aren't critical errors,
> just an annoyance as the application will sometimes have to repair the
> conflicts. Presumably you'll already be repairing conflicts, so fixing
> ones that result from spurious conflicts (especially as they are so
> rare) is not a big cost, and should be easy as the application can
> just pick a random version of the doc, with no ill effects.

As native English speakers, it's fairly easy for us to assume that most
documents are comprised of some simple Latin character subset. As soon as you
start working with languages that make heavy use of combining characters -
accents, diacritical marks, etc - then this character normalisation becomes a
major issue in a multi-user environment.

Consider the following byte sequences:

  U+006B U+014D U+0061 U+006E

  U+006B U+006F U+0304 U+0061 U+006E

Both of these look like "kōan" yet the byte sequence depends on my input method.

> I think this makes the NFC stuff a nice-to-have, not a necessity.

I disagree strongly.

Best,

-- 
Noah Slater, http://tumbolia.org/nslater

Mime
View raw message