Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 31237 invoked from network); 23 Jun 2009 16:36:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Jun 2009 16:36:23 -0000 Received: (qmail 99508 invoked by uid 500); 23 Jun 2009 16:36:34 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 99451 invoked by uid 500); 23 Jun 2009 16:36:34 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 99441 invoked by uid 99); 23 Jun 2009 16:36:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jun 2009 16:36:34 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of paul.joseph.davis@gmail.com designates 209.85.132.243 as permitted sender) Received: from [209.85.132.243] (HELO an-out-0708.google.com) (209.85.132.243) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jun 2009 16:36:24 +0000 Received: by an-out-0708.google.com with SMTP id b6so79152ana.5 for ; Tue, 23 Jun 2009 09:36:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=cXTtRJOggPx9y5WSI9w3EqnpudhBkcD1r+eYB4qc85I=; b=ARgWAs4NQYzWiaRbsX7i02LndlOzHperSOeBYja4m3PdF3oU7w6IYMV2E4D6Ia0Szc YO7yaAfKKgaly0yEh1ndYJuJaVOtWTQ6sCfLsTiVlSACwd+3CPZGdiA6q5piZgV0EAhG zcScM4boiJnfiYBeoHUHu7f3RWl2MCrOX+yQ8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=kbl8/zir67MNyw99ANWyUiy4QPDlJOgPdW9iJaFj2UGmteEmpnTjm5aeKwtOkLCG9n UKZCHgtyf8RHYazxnyg2O7coUJxXeEnboNi+Ymfvt44Tyj3Gdf9oPCsEPbwiapeqc480 i7/aV/8AJ7I852Xb0h54w5WEtcQpvpeJRoatI= MIME-Version: 1.0 Received: by 10.100.6.16 with SMTP id 16mr381159anf.52.1245774479456; Tue, 23 Jun 2009 09:27:59 -0700 (PDT) In-Reply-To: References: <20090622184640.GB7936@tumbolia.org> <20090622193208.GD7936@tumbolia.org> <20090623015247.GF7936@tumbolia.org> <9A2AA6F1-779E-446A-B055-7F1231E8977B@gmail.com> Date: Tue, 23 Jun 2009 12:27:59 -0400 Message-ID: Subject: Re: Unicode normalization (was Re: The 1.0 Thread) From: Paul Davis To: dev@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Tue, Jun 23, 2009 at 11:15 AM, Antony Blakey wr= ote: > > On 23/06/2009, at 11:43 PM, Paul Davis wrote: > >> Interesting point. I take this as a pretty clear reason for discarding >> UCA for member ordering. Normalization isn't affected by locale right? >> I haven't seen anything to suggest as such so I assume not. > > IIUC each normalization form is strictly functional. > > IMO, given that ICU provides normalization functions, CouchDB should use > them in this case, exposing the canonicalisation transformation, and a > shortcut producing the hash, as a client-accessible endpoint. I say this > from a 'why not do it right?' perspective. > Regardless of whether normalization is included in the deterministic revision algorithm I think that normalization end points, or perhaps query string parameters for performing normalization on incoming data would be a good thing. I reject the assertion that this is a question of right or wrong. I see valid arguments for and against including normalization as part of the deterministic revision algorithm. Considering a query parameter to process incoming data conditionally I'm leaning towards not including it, but I'm always open to being convinced otherwise. > Member ordering could be binary, over either the code points (e.g. 32 bit= s) > or the bytes of the UTF-8 representation. Given the ease of creating a UT= F-8 > iterator that is probably best. UTF-16 is the most common native encoding= , > but you don't want to do a byte-level collation over a UTF-16/32 encoding > because the result is dependent on byte ordering. > Are there byte order semantics for UTF-8? Or other cases where sorting by UTF-8 binary representation is going to cause issues? Remember that the end goal is to create deterministic serializations for hashing. Sorting by code point doesn't seem like it'd get us anything other than added complexity. > The problem with this is that the canonical form might look bizarre for a > non-ASCII document, but a canonical collation is by definition always goi= ng > to look wrong to someone. For the current use, as an intermediate form > destined only for hashing, this doesn't matter anyway. > Yep. > Having said that, IMO it would be a good i18n feature to be able set the > locale of a database, maybe even at the granularity of a view, defaulting= to > the database's locale. The key ordering should respect that locale. An > option to normalize keys would also be a good idea. The reason for settin= g a > locale at the view level is that it might be useful to create multiple vi= ews > with different locales, to present different localized result orderings t= o > end users. One immediate issue is that the local would have to be injecte= d > into view servers to prevent possible weirdness. > > I think it's easier and better to do these kind of things on the server > because you know you have the facilities to do it there (e.g. ICU), where= as > making it a client issue impedes use of the data by different clients. > Patches welcome. Paul Davis > Antony Blakey > ------------- > CTO, Linkuistics Pty Ltd > Ph: 0438 840 787 > > On the other side, you have the customer and/or user, and they tend to do > what we call "automating the pain." They say, "What is it we're doing now= ? > How would that look if we automated it?" Whereas, what the design process > should properly be is one of saying, "What are the goals we're trying to > accomplish and how can we get rid of all this task crap?" > =A0-- Alan Cooper > > >