Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9480C9992 for ; Fri, 8 Jun 2012 17:41:49 +0000 (UTC) Received: (qmail 8925 invoked by uid 500); 8 Jun 2012 17:41:47 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 8851 invoked by uid 500); 8 Jun 2012 17:41:47 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 8841 invoked by uid 99); 8 Jun 2012 17:41:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Jun 2012 17:41:47 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ivarley@salesforce.com designates 64.18.3.86 as permitted sender) Received: from [64.18.3.86] (HELO exprod8og103.obsmtp.com) (64.18.3.86) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 08 Jun 2012 17:41:41 +0000 Received: from exsfm-hub3.internal.salesforce.com ([204.14.239.238]) by exprod8ob103.postini.com ([64.18.7.12]) with SMTP ID DSNKT9I5QEmJbvUHWMNB4gz0tdkYd1ajJ+Qu@postini.com; Fri, 08 Jun 2012 10:41:21 PDT Received: from EXSFM-MB01.internal.salesforce.com ([10.1.127.46]) by exsfm-hub3.internal.salesforce.com ([10.1.127.7]) with mapi; Fri, 8 Jun 2012 10:41:19 -0700 From: Ian Varley To: "user@hbase.apache.org" Date: Fri, 8 Jun 2012 10:40:59 -0700 Subject: Re: Collation order of items Thread-Topic: Collation order of items Thread-Index: Ac1FneoV+ZJNJTCETquQKGleReo2CA== Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Tom, another approach you could take would be to store an ASCII encoded ver= sion of the string as the row key or column qualifier, and then the full UT= F-8 string elsewhere (e.g. in the cell value, or even later in the row key)= . That wouldn't work out the fine sorting (whether "=E8" sorts before or af= ter "e") but it would solve the gross sorting ("=E8" would always come befo= re "f"). If you need true UTF-8 collation in the results, you could then im= plement it as a layer on top of that (in your app, or maybe a co-processor,= I'm not sure about the latter). But at least with this approach, you'd be = able to take advantage of rowkey ranges in your scans, which would probably= make up for any time spent doing a secondary sort. Ian On Jun 8, 2012, at 12:34 PM, Tom Brown wrote: > Storing the bytes as native UTF-16 or UTF-32 will not help. Even > strings in UTF-8 format can be sorted by their code points when stored > as bytes. Unfortunately, that's not really useful for collation as > characters like "=E8" (U+00E8) should appear between "e" (U+0065) and > "f" (U+0066), but the code points to not allow this. >=20 > Thanks anyway! >=20 > --Tom >=20 > On Fri, Jun 8, 2012 at 11:14 AM, Stack wrote: >> On Fri, Jun 8, 2012 at 9:35 AM, Tom Brown wrote: >>> Is there any way to control introduce a different ordering scheme from >>> the base comparable bytes? My use case is that I am using UTF-8 data >>> for my keys, and I would like to have scans use UTF-8 collation. >>>=20 >>> Could this be done by providing an alternate implementation of >>> WritableComparable? >>>=20 >>> Thanks in advance! >>>=20 >>=20 >> Unfortunately no Tom. The database is all sorted the same way. >> Different sorts per table would complicate system interactions (the >> catalog tables would have to change sort by table). It might be >> doable but it would take some work. >>=20 >> Can you store your data UTF-16 or UTF-32? Its a while since I dealt >> w/ this stuff but IIRC, their sort order is byte order? (WARNING! I >> could be way off here). >>=20 >> St.Ack