Return-Path: Delivered-To: apmail-incubator-cassandra-dev-archive@minotaur.apache.org Received: (qmail 17707 invoked from network); 2 Apr 2009 03:44:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Apr 2009 03:44:29 -0000 Received: (qmail 60626 invoked by uid 500); 2 Apr 2009 03:44:29 -0000 Delivered-To: apmail-incubator-cassandra-dev-archive@incubator.apache.org Received: (qmail 60582 invoked by uid 500); 2 Apr 2009 03:44:29 -0000 Mailing-List: contact cassandra-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-dev@incubator.apache.org Delivered-To: mailing list cassandra-dev@incubator.apache.org Received: (qmail 60572 invoked by uid 99); 2 Apr 2009 03:44:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2009 03:44:29 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of avinash.lakshman@gmail.com designates 209.85.146.180 as permitted sender) Received: from [209.85.146.180] (HELO wa-out-1112.google.com) (209.85.146.180) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2009 03:44:19 +0000 Received: by wa-out-1112.google.com with SMTP id n7so192774wag.12 for ; Wed, 01 Apr 2009 20:43:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=Vok1yAhUTEf63gobfBYGDgRvb5XZNZMaPyYw9V5zDII=; b=OWGrRZMy9OPgeKgThuwWNklxKiCmbAtB/xUd5Dhte2NfUMB0T+Vuc1t5iUmS2xBqKJ g7Yo4YQUtHNgQk4khxpg1pHKcIEWnbHwn49yZG0otho6Vb68oCJgcy7JXs3Zg/A3W0Qc s8WTW2/47CergT151hXS9KCGng/wXHnWZ7ZTk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=DCAHlHJ93zlmBRYIrpDmSqH3jR62UawltJQvOhuy7jSMfnE7XQ3WW3QztT6ER5GIml Gd0BeP+tKZfVimBm0atmCRRIZc3aMIpCKGUTvMw1rdHUz2u3b89Q8binybBf6YU5wKRD MtKNkykbowRaFol86c4W2Oc4f2Njszb92a5T4= MIME-Version: 1.0 Received: by 10.114.197.10 with SMTP id u10mr5638894waf.96.1238643839136; Wed, 01 Apr 2009 20:43:59 -0700 (PDT) In-Reply-To: References: Date: Wed, 1 Apr 2009 20:43:59 -0700 Message-ID: Subject: Re: OPHF From: Avinash Lakshman To: cassandra-dev@incubator.apache.org Content-Type: multipart/alternative; boundary=001636457ae008f91e04668a3e79 X-Virus-Checked: Checked by ClamAV on apache.org --001636457ae008f91e04668a3e79 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit So why do you want to sort with collation? Because if you sort that way for distribution then everything from sorting of keys on disk will have to change. This is not something that I can see any apparent reason for. But I will look at this. Avinash On Wed, Apr 1, 2009 at 3:34 PM, Jonathan Ellis wrote: > So with a String ring you just sort the strings (conceptually) and you > wrap around from strings[last] to strings[0]. The range is entirely > defined by what tokens you have. > > It's nice to be able to test against a real application. Mine won't > run without range queries... (So it is still running against my > github code where I first wrote range queries, but that uses the OPHF > approach you have and so that is where I found the problems.) > > -Jonathan > > On Wed, Apr 1, 2009 at 4:27 PM, Avinash Lakshman > wrote: > > I will look into this. Another point about P2P rings is something about > the > > ability to reason about the system. In any hash fn we hve a notion of > range > > and how the ring wraps around. For eg. in random hash such as MD5 we say > > 0-2^128 -1 and then the wrap around. Similarly for the hash fn used here > one > > could define the same. Today we perform usual sort of strings and the > hash > > should respect that sort order. I guess you are saying the usual sort of > > string does not respect collation. I will look into this and I insist > this > > should be very doable. Changes to the implementation of the core classes > > albeit how simple they are very very scary unless they are completely > tested > > too in a distributed setting. Over here we do not have detailed test > code, > > but we test by directing a % of the site traffic to a test cluster before > we > > sign off on anything. > > > > Avinash > > > > On Wed, Apr 1, 2009 at 3:00 PM, Jonathan Ellis > wrote: > > > >> Say for instance you have inserted keys ['a', 'b', '--a', '--b'] and > >> you are going to do range queries on them. The correct > >> collation-aware sort is ['a', '--a', 'b', '--b']. But ordering by > >> char value gives ['--a', '--b', 'a', 'b']. > >> > >> Attached is a test program illustrating this. The assert will fail > >> because the hash ordering is not the same as the collator's. > >> > >> -Jonathan > >> > >> On Wed, Apr 1, 2009 at 3:47 PM, Avinash Lakshman > >> wrote: > >> > In fact what would you want is hash to respect oorder based on how the > >> > strings are sorted. For the example you have it does. So am I missing > >> > something? > >> > > >> > Avinash > >> > > >> > On Wed, Apr 1, 2009 at 2:43 PM, Jonathan Ellis > >> wrote: > >> > > >> >> For that aspect no difference between a String ring based on > compareTo > >> >> and a BigInteger one. The only difference (and it is an important > one > >> >> for the reasons I gave!) is how the compare works. But for the p2p > >> >> aspect it does not matter. > >> >> > >> >> On Wed, Apr 1, 2009 at 3:40 PM, Avinash Lakshman > >> >> wrote: > >> >> > Doing what you are suggesting scares the hell out of me for a > couple > >> of > >> >> > reasons - All work in P2P be it random/OPHF does the token > handling > >> the > >> >> way > >> >> > it is setup. I cannot try something that has not been well explored > in > >> >> > academia. I insist this must be doable. I am going to think about > this > >> >> more. > >> >> > > >> >> > Avinash > >> >> > > >> >> > On Wed, Apr 1, 2009 at 2:31 PM, Jonathan Ellis > >> >> wrote: > >> >> > > >> >> >> Avinash already commited his new order-preserving hash function > and I > >> >> >> missed it. It's in OrderPreservingHashPartitioner. It takes the > >> >> >> approach that Todd and I discussed back in January: turn the > string > >> >> >> into a base-Char.MAX_VALUE number. > >> >> >> ( > >> >> >> > >> >> > >> > http://groups.google.com/group/cassandra-dev/browse_thread/thread/6bda8518466210e7/f53b79c19010a9ed > >> >> >> ). > >> >> >> I chatted with Avinash a little on IM but we didn't finish, so > I'm > >> >> >> picking it up here. > >> >> >> > >> >> >> There are two problems with this approach. One is that the hashes > >> >> >> will only be order-preserving for a subset of unicode (all of > UCS-2, > >> >> >> but not all of UTF-16; see > http://en.wikipedia.org/wiki/UTF-16/UCS-2 > >> ). > >> >> >> The other is that this only gives you a naive ordering by code > point > >> >> >> value, which for unicode is not what you want and even for ascii > >> >> >> sometimes you want another collation. (see > >> >> >> http://www.unicode.org/reports/tr10/ and > >> >> >> http://java.sun.com/javase/6/docs/api/java/text/Collator.html). > >> >> >> > >> >> >> Say for instance you have inserted keys ['a', 'b', '--a', '--b'] > and > >> >> >> you are going to do range queries on them. The correct > >> >> >> collation-aware sort is ['a', '--a', 'b', '--b']. But ordering by > >> >> >> char value gives ['--a', '--b', 'a', 'b']. > >> >> >> > >> >> >> Switching to a more flexibile system like the one I wrote for > >> >> >> CASSANDRA-3 will let use use Token for random > >> distribution > >> >> >> or Token for order-preserving, with user-defined > collation. > >> I > >> >> >> don't see a way to get this kind of flexibility from an approach > that > >> >> >> insists on turning everything into BigInteger. > >> >> >> > >> >> >> -Jonathan > >> >> >> > >> >> >> On Mon, Mar 30, 2009 at 2:16 PM, Jonathan Ellis < > jbellis@gmail.com> > >> >> wrote: > >> >> >> > Avinash, > >> >> >> > > >> >> >> > You mentioned that you have a new order-preserving hash function > >> that > >> >> >> > you think will be more generally useful. Can you post it? > >> >> >> > > >> >> >> > thanks, > >> >> >> > > >> >> >> > -Jonathan > >> >> >> > > >> >> >> > >> >> > > >> >> > >> > > >> > > > --001636457ae008f91e04668a3e79--