Return-Path: Delivered-To: apmail-incubator-cassandra-dev-archive@minotaur.apache.org Received: (qmail 83411 invoked from network); 5 Apr 2009 03:32:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 5 Apr 2009 03:32:57 -0000 Received: (qmail 63746 invoked by uid 500); 5 Apr 2009 03:32:57 -0000 Delivered-To: apmail-incubator-cassandra-dev-archive@incubator.apache.org Received: (qmail 63698 invoked by uid 500); 5 Apr 2009 03:32:56 -0000 Mailing-List: contact cassandra-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-dev@incubator.apache.org Delivered-To: mailing list cassandra-dev@incubator.apache.org Received: (qmail 63688 invoked by uid 99); 5 Apr 2009 03:32:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Apr 2009 03:32:56 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of schumi.han@gmail.com designates 209.85.146.178 as permitted sender) Received: from [209.85.146.178] (HELO wa-out-1112.google.com) (209.85.146.178) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Apr 2009 03:32:47 +0000 Received: by wa-out-1112.google.com with SMTP id n7so854176wag.12 for ; Sat, 04 Apr 2009 20:32:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=xQtSPaj5mqFrBqJy2+JehtC9dOYfWOWA/YCAfG4GycI=; b=OXfhnmCgB1lcD9sfm7KcANyp28wFxBrkYcT+5y1JVhCNmIgQ/12BkhEMB+SYDm3NcU hz071vSFiFCaYHWTMCI0njrfqbgKFgwKIFAiCGYJOSJ1lsNMgi4pO+vNHEB0gA0BXex9 Th1vvVNsYzeGO0SljVMKCA/R02vfeQOYQWxY0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=CAq31pR023H4J/lpBP9X3XjH7Y3e7zXix5rSyfWanbw++0ChVyqIDBSMRRwHgkWNht NbNN4EbmFBe7a/ZVBdna2GQpxH4BAlz8Lnq7s7FYV3+X6cbRQfTDq3bpm8z8M1OgOjzs r8QvK5UfHWj6jP+cFPVod/yl63Cam2RfCa/zI= MIME-Version: 1.0 Received: by 10.114.210.3 with SMTP id i3mr357484wag.226.1238902346688; Sat, 04 Apr 2009 20:32:26 -0700 (PDT) In-Reply-To: References: <4e777ed10904040018r6928cc6td21eace06c0b99ef@mail.gmail.com> Date: Sun, 5 Apr 2009 11:32:26 +0800 Message-ID: <4e777ed10904042032w55fffc91vd94b450d0aae17c9@mail.gmail.com> Subject: Re: OPHF From: Zhu Han To: cassandra-dev@incubator.apache.org Content-Type: multipart/alternative; boundary=0016364c5adf492a680466c66ee5 X-Virus-Checked: Checked by ClamAV on apache.org --0016364c5adf492a680466c66ee5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Avinash, (For cassandra, the "key" mentioned in following words is the name of the row. The "token" mentioned in following words, is the output of the hash function. ) For example, if you use 160bit md5 hash value as the space, it's finite theoretically, because you know the number of tokens which can be contained ahead of time, although the number is pretty large. For finite hash space, the mapping from key to token is many to one instead of one to one. For OHPF, a lot of keys could be mapped to the same token range because of the non-uniform distribution just as you pointed out. The load balancing can solve the problem. However, if too may keys mapped to *the same token* which is determined by the hash function, how can the load balance be carried out? When the hash space is finite, it's possible that a large number of keys are mapped to the same token. However, if the hash function is uniformly distributed, just as MD5 or SHA1, the probability of the worse case is pretty rare, at least this case cannot be produced by human purposely, so that's not a problem. But for OPHF, if the clients pick up the name of key purposely, e.g. for DOS attack, the worst case can occur practically. When the attacker knows the algorithm of the OPHF, he can produce arbitrary number of keys mapped to the same token. If the hash space is infinite, or it's as large as the possible space of keys, this would not be a problem, because the key to token mapping is one to one instead of many to one . best regards, hanzhu On Sat, Apr 4, 2009 at 11:59 PM, Avinash Lakshman < avinash.lakshman@gmail.com> wrote: > Not sure what you mean by infinite. OPHF or not any hash function's range > is > practically infinite for MD5 is infinite hash range. With OPHF you will > have > a lot of skew. There are only two ways to deal with it: > (1) You know the space of of all your keys ahead of time and have your > system exploit that. > (2) Dynamically load balance until your system reaches steady state. > > We are always sorting lexicographically and trying to do both of the above. > > Avinash > > On Sat, Apr 4, 2009 at 12:18 AM, Zhu Han wrote: > > > Avinash, > > > > I support the proposal of Jonathan. > > > > Theoretically, if the system wants to support range query over > consistent > > hash(DHT), the hash space should be infinite. Otherwise, a large number > > of > > keys could be mapped to the same token and it's impossible to do load > > balance by moving these keys around to other physical servers. > Especially, > > the churn of OPHF could be very high because the user would like to > create > > a > > lot of keys with the same long prefix for some applications. If the > system > > takes the string as the token and sorts them in *lexicographic order, the > > hash space would be infinite, so that the churn can be solved by load > > balacing under any circumstances. > > > > *best regards, > > hanzhu > > > > > > On Thu, Apr 2, 2009 at 11:04 PM, Avinash Lakshman < > > avinash.lakshman@gmail.com> wrote: > > > > > So how are you coming up with the tokens here? What do you mean by > > > string[0] > > > and string[last]? Are they the keys that belong to the system? If so > then > > > that is not a mechanism to reason about the range of your hash > function. > > In > > > this scheme does it mean you will need to sort the tokens too using > > > collation scheme used by the external partitioner while identifying > which > > > key goes to which node. Also could you provide me with the patch > number. > > I > > > need to go over that this weekend and make sure if it does not affect > the > > > (bootstrap/load balance) logic. My fear is that these changes have long > > > reaching effects and I need to make sure that all these pieces will > > > continue > > > to reside in harmony. Also your range query test did it run in a > > > distributed > > > environment or in a single box environment? > > > Avinash > > > > > > On Wed, Apr 1, 2009 at 3:34 PM, Jonathan Ellis > > wrote: > > > > > > > So with a String ring you just sort the strings (conceptually) and > you > > > > wrap around from strings[last] to strings[0]. The range is entirely > > > > defined by what tokens you have. > > > > > > > > It's nice to be able to test against a real application. Mine won't > > > > run without range queries... (So it is still running against my > > > > github code where I first wrote range queries, but that uses the OPHF > > > > approach you have and so that is where I found the problems.) > > > > > > > > -Jonathan > > > > > > > > On Wed, Apr 1, 2009 at 4:27 PM, Avinash Lakshman > > > > wrote: > > > > > I will look into this. Another point about P2P rings is something > > about > > > > the > > > > > ability to reason about the system. In any hash fn we hve a notion > of > > > > range > > > > > and how the ring wraps around. For eg. in random hash such as MD5 > we > > > say > > > > > 0-2^128 -1 and then the wrap around. Similarly for the hash fn used > > > here > > > > one > > > > > could define the same. Today we perform usual sort of strings and > the > > > > hash > > > > > should respect that sort order. I guess you are saying the usual > sort > > > of > > > > > string does not respect collation. I will look into this and I > insist > > > > this > > > > > should be very doable. Changes to the implementation of the core > > > classes > > > > > albeit how simple they are very very scary unless they are > completely > > > > tested > > > > > too in a distributed setting. Over here we do not have detailed > test > > > > code, > > > > > but we test by directing a % of the site traffic to a test cluster > > > before > > > > we > > > > > sign off on anything. > > > > > > > > > > Avinash > > > > > > > > > > On Wed, Apr 1, 2009 at 3:00 PM, Jonathan Ellis > > > > wrote: > > > > > > > > > >> Say for instance you have inserted keys ['a', 'b', '--a', '--b'] > and > > > > >> you are going to do range queries on them. The correct > > > > >> collation-aware sort is ['a', '--a', 'b', '--b']. But ordering by > > > > >> char value gives ['--a', '--b', 'a', 'b']. > > > > >> > > > > >> Attached is a test program illustrating this. The assert will > fail > > > > >> because the hash ordering is not the same as the collator's. > > > > >> > > > > >> -Jonathan > > > > >> > > > > >> On Wed, Apr 1, 2009 at 3:47 PM, Avinash Lakshman > > > > >> wrote: > > > > >> > In fact what would you want is hash to respect oorder based on > how > > > the > > > > >> > strings are sorted. For the example you have it does. So am I > > > missing > > > > >> > something? > > > > >> > > > > > >> > Avinash > > > > >> > > > > > >> > On Wed, Apr 1, 2009 at 2:43 PM, Jonathan Ellis < > jbellis@gmail.com > > > > > > > >> wrote: > > > > >> > > > > > >> >> For that aspect no difference between a String ring based on > > > > compareTo > > > > >> >> and a BigInteger one. The only difference (and it is an > > important > > > > one > > > > >> >> for the reasons I gave!) is how the compare works. But for the > > p2p > > > > >> >> aspect it does not matter. > > > > >> >> > > > > >> >> On Wed, Apr 1, 2009 at 3:40 PM, Avinash Lakshman > > > > >> >> wrote: > > > > >> >> > Doing what you are suggesting scares the hell out of me for a > > > > couple > > > > >> of > > > > >> >> > reasons - All work in P2P be it random/OPHF does the token > > > > handling > > > > >> the > > > > >> >> way > > > > >> >> > it is setup. I cannot try something that has not been well > > > explored > > > > in > > > > >> >> > academia. I insist this must be doable. I am going to think > > about > > > > this > > > > >> >> more. > > > > >> >> > > > > > >> >> > Avinash > > > > >> >> > > > > > >> >> > On Wed, Apr 1, 2009 at 2:31 PM, Jonathan Ellis < > > > jbellis@gmail.com> > > > > >> >> wrote: > > > > >> >> > > > > > >> >> >> Avinash already commited his new order-preserving hash > > function > > > > and I > > > > >> >> >> missed it. It's in OrderPreservingHashPartitioner. It > takes > > > the > > > > >> >> >> approach that Todd and I discussed back in January: turn the > > > > string > > > > >> >> >> into a base-Char.MAX_VALUE number. > > > > >> >> >> ( > > > > >> >> >> > > > > >> >> > > > > >> > > > > > > > > > > http://groups.google.com/group/cassandra-dev/browse_thread/thread/6bda8518466210e7/f53b79c19010a9ed > > > > >> >> >> ). > > > > >> >> >> I chatted with Avinash a little on IM but we didn't finish, > > so > > > > I'm > > > > >> >> >> picking it up here. > > > > >> >> >> > > > > >> >> >> There are two problems with this approach. One is that the > > > hashes > > > > >> >> >> will only be order-preserving for a subset of unicode (all > of > > > > UCS-2, > > > > >> >> >> but not all of UTF-16; see > > > > http://en.wikipedia.org/wiki/UTF-16/UCS-2 > > > > >> ). > > > > >> >> >> The other is that this only gives you a naive ordering by > > code > > > > point > > > > >> >> >> value, which for unicode is not what you want and even for > > ascii > > > > >> >> >> sometimes you want another collation. (see > > > > >> >> >> http://www.unicode.org/reports/tr10/ and > > > > >> >> >> > http://java.sun.com/javase/6/docs/api/java/text/Collator.html > > ). > > > > >> >> >> > > > > >> >> >> Say for instance you have inserted keys ['a', 'b', '--a', > > '--b'] > > > > and > > > > >> >> >> you are going to do range queries on them. The correct > > > > >> >> >> collation-aware sort is ['a', '--a', 'b', '--b']. But > > ordering > > > by > > > > >> >> >> char value gives ['--a', '--b', 'a', 'b']. > > > > >> >> >> > > > > >> >> >> Switching to a more flexibile system like the one I wrote > for > > > > >> >> >> CASSANDRA-3 will let use use Token for random > > > > >> distribution > > > > >> >> >> or Token for order-preserving, with user-defined > > > > collation. > > > > >> I > > > > >> >> >> don't see a way to get this kind of flexibility from an > > approach > > > > that > > > > >> >> >> insists on turning everything into BigInteger. > > > > >> >> >> > > > > >> >> >> -Jonathan > > > > >> >> >> > > > > >> >> >> On Mon, Mar 30, 2009 at 2:16 PM, Jonathan Ellis < > > > > jbellis@gmail.com> > > > > >> >> wrote: > > > > >> >> >> > Avinash, > > > > >> >> >> > > > > > >> >> >> > You mentioned that you have a new order-preserving hash > > > function > > > > >> that > > > > >> >> >> > you think will be more generally useful. Can you post it? > > > > >> >> >> > > > > > >> >> >> > thanks, > > > > >> >> >> > > > > > >> >> >> > -Jonathan > > > > >> >> >> > > > > > >> >> >> > > > > >> >> > > > > > >> >> > > > > >> > > > > > >> > > > > > > > > > > > > > > > --0016364c5adf492a680466c66ee5--