From dev-return-4708-apmail-couchdb-dev-archive=couchdb.apache.org@couchdb.apache.org Tue Jun 23 15:22:54 2009 Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 2303 invoked from network); 23 Jun 2009 15:22:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Jun 2009 15:22:54 -0000 Received: (qmail 26108 invoked by uid 500); 23 Jun 2009 15:21:44 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 25743 invoked by uid 500); 23 Jun 2009 15:21:43 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 24187 invoked by uid 99); 23 Jun 2009 15:16:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jun 2009 15:16:39 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of antony.blakey@gmail.com designates 209.85.219.216 as permitted sender) Received: from [209.85.219.216] (HELO mail-ew0-f216.google.com) (209.85.219.216) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jun 2009 15:16:28 +0000 Received: by ewy12 with SMTP id 12so243591ewy.11 for ; Tue, 23 Jun 2009 08:16:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:from:to :in-reply-to:content-type:content-transfer-encoding:mime-version :subject:date:references:x-mailer; bh=5b3+SiB6ayhNuxwk46cq1yJ15GtQc8+quloAJrwEyQY=; b=VkCbCBdOgfHlpAxT4JqOUQ4fTvLprtyvVEAtQnWlmD5SGXiBVqO+64mFgeL1WM5iHu lbBUTGufpf3JgiBoNz7aok9/te7rlbmMZj+r2XspEtuVl1hclKhnBQKQ6ntlfaM0A9Tf RJEFn1rNBrO1wrFKCqIbAlbCT9g7MoIXJe+u0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:from:to:in-reply-to:content-type :content-transfer-encoding:mime-version:subject:date:references :x-mailer; b=P2O+m9FVyTyHG5Po79unlvtrDcHgLC7A4norisJNIBz3KKrL36eJBPobgotbZ6WSE7 4APbNUgskGQ/y+quIzCcJhjqo5Z1YH1m7wdhPcFmUq9pDLiwKORv4iqMrWUkzBjXXk3m 2JVx84tRh8Q79G7kc3JWcYbhSOfZ2b6j+BGlg= Received: by 10.216.53.207 with SMTP id g57mr70042wec.3.1245770168655; Tue, 23 Jun 2009 08:16:08 -0700 (PDT) Received: from ?192.168.0.18? (ppp121-45-76-13.lns10.adl6.internode.on.net [121.45.76.13]) by mx.google.com with ESMTPS id g9sm418610gvc.10.2009.06.23.08.16.04 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 23 Jun 2009 08:16:07 -0700 (PDT) Message-Id: From: Antony Blakey To: dev@couchdb.apache.org In-Reply-To: Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v935.3) Subject: Re: Unicode normalization (was Re: The 1.0 Thread) Date: Wed, 24 Jun 2009 00:45:58 +0930 References: <20090622143637.GA30864@tumbolia.org> <20090622184640.GB7936@tumbolia.org> <20090622193208.GD7936@tumbolia.org> <20090623015247.GF7936@tumbolia.org> <9A2AA6F1-779E-446A-B055-7F1231E8977B@gmail.com> X-Mailer: Apple Mail (2.935.3) X-Virus-Checked: Checked by ClamAV on apache.org On 23/06/2009, at 11:43 PM, Paul Davis wrote: > Interesting point. I take this as a pretty clear reason for discarding > UCA for member ordering. Normalization isn't affected by locale right? > I haven't seen anything to suggest as such so I assume not. IIUC each normalization form is strictly functional. IMO, given that ICU provides normalization functions, CouchDB should use them in this case, exposing the canonicalisation transformation, and a shortcut producing the hash, as a client-accessible endpoint. I say this from a 'why not do it right?' perspective. Member ordering could be binary, over either the code points (e.g. 32 bits) or the bytes of the UTF-8 representation. Given the ease of creating a UTF-8 iterator that is probably best. UTF-16 is the most common native encoding, but you don't want to do a byte-level collation over a UTF-16/32 encoding because the result is dependent on byte ordering. The problem with this is that the canonical form might look bizarre for a non-ASCII document, but a canonical collation is by definition always going to look wrong to someone. For the current use, as an intermediate form destined only for hashing, this doesn't matter anyway. Having said that, IMO it would be a good i18n feature to be able set the locale of a database, maybe even at the granularity of a view, defaulting to the database's locale. The key ordering should respect that locale. An option to normalize keys would also be a good idea. The reason for setting a locale at the view level is that it might be useful to create multiple views with different locales, to present different localized result orderings to end users. One immediate issue is that the local would have to be injected into view servers to prevent possible weirdness. I think it's easier and better to do these kind of things on the server because you know you have the facilities to do it there (e.g. ICU), whereas making it a client issue impedes use of the data by different clients. Antony Blakey ------------- CTO, Linkuistics Pty Ltd Ph: 0438 840 787 On the other side, you have the customer and/or user, and they tend to do what we call "automating the pain." They say, "What is it we're doing now? How would that look if we automated it?" Whereas, what the design process should properly be is one of saying, "What are the goals we're trying to accomplish and how can we get rid of all this task crap?" -- Alan Cooper