Return-Path: Delivered-To: apmail-incubator-cassandra-dev-archive@minotaur.apache.org Received: (qmail 8290 invoked from network); 5 Apr 2009 05:42:44 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 5 Apr 2009 05:42:44 -0000 Received: (qmail 94161 invoked by uid 500); 5 Apr 2009 05:42:43 -0000 Delivered-To: apmail-incubator-cassandra-dev-archive@incubator.apache.org Received: (qmail 94115 invoked by uid 500); 5 Apr 2009 05:42:43 -0000 Mailing-List: contact cassandra-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-dev@incubator.apache.org Delivered-To: mailing list cassandra-dev@incubator.apache.org Received: (qmail 94105 invoked by uid 99); 5 Apr 2009 05:42:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Apr 2009 05:42:43 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of schumi.han@gmail.com designates 209.85.146.178 as permitted sender) Received: from [209.85.146.178] (HELO wa-out-1112.google.com) (209.85.146.178) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Apr 2009 05:42:33 +0000 Received: by wa-out-1112.google.com with SMTP id n7so864590wag.12 for ; Sat, 04 Apr 2009 22:42:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=+cbRmt31b2hfVesNCWE9y+CZpXblridHuERu1sRiNkI=; b=GktdhvG/ta3pLbO8tfIC5+A3rdQrWQ5N70lkPIjH4IUfBkpTRcHQ2M/04sKxNFno1s RbPksm8ZDke4G42H9KGPumTQNk7nXysjQj7of45lXhYzjCaCW6dpNIWTGTlpV05m2bz2 m6D4PS5jRb1voluvkGRGXJCqd8r1X0oDbDWSE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=rRpZGLDCzHJK56RmjJLoAwWIMhWR+1od7RkwFPhwrmQ55gkq9uLUR8OQLV1IMQc+Ye i4A2f+prKw2q29CZLEHCM7hQAdJadu6rlOIP9cr3RZsE/53Ob4U5Z+VlrQ+TGTAwuiGr SQW/Onq2JPNDTMnzwGwOsAlDDtNllLKzOQKoc= MIME-Version: 1.0 Received: by 10.114.184.11 with SMTP id h11mr1535039waf.100.1238910132797; Sat, 04 Apr 2009 22:42:12 -0700 (PDT) In-Reply-To: <1DB35D6C-485E-4A7B-B6B9-11409C90D124@gmail.com> References: <4e777ed10904040018r6928cc6td21eace06c0b99ef@mail.gmail.com> <4e777ed10904042032w55fffc91vd94b450d0aae17c9@mail.gmail.com> <1DB35D6C-485E-4A7B-B6B9-11409C90D124@gmail.com> Date: Sun, 5 Apr 2009 13:42:12 +0800 Message-ID: <4e777ed10904042242o38f38277xb27ba0332a7e524e@mail.gmail.com> Subject: Re: OPHF From: Zhu Han To: cassandra-dev@incubator.apache.org Content-Type: multipart/alternative; boundary=00163646d8705fc2b50466c83e41 X-Virus-Checked: Checked by ClamAV on apache.org --00163646d8705fc2b50466c83e41 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit I see. Thanks! best regards, hanzhu On Sun, Apr 5, 2009 at 12:24 PM, Jonathan ellis wrote: > Note that Avinash's ophf is 1-1 up to the max key length. But this has its > own problems. (Mainly, that keys cannot be arbitrarily long, and hash > computation gets more expensive proportional to max key length.) > > -Jonathan > > > On Apr 4, 2009, at 10:32 PM, Zhu Han wrote: > > Avinash, >> >> (For cassandra, the "key" mentioned in following words is the name of the >> row. The "token" mentioned in following words, is the output of the hash >> function. ) >> >> For example, if you use 160bit md5 hash value as the space, it's finite >> theoretically, because you know the number of tokens which can be >> contained >> ahead of time, although the number is pretty large. For finite hash space, >> the mapping from key to token is many to one instead of one to one. >> >> For OHPF, a lot of keys could be mapped to the same token range because >> of >> the non-uniform distribution just as you pointed out. The load balancing >> can >> solve the problem. However, if too may keys mapped to *the same token* >> which is determined by the hash function, how can the load balance be >> carried out? When the hash space is finite, it's possible that a large >> number of keys are mapped to the same token. However, if the hash function >> is uniformly distributed, just as MD5 or SHA1, the probability of the >> worse >> case is pretty rare, at least this case cannot be produced by human >> purposely, so that's not a problem. But for OPHF, if the clients pick up >> the name of key purposely, e.g. for DOS attack, the worst case can occur >> practically. When the attacker knows the algorithm of the OPHF, he can >> produce arbitrary number of keys mapped to the same token. >> >> If the hash space is infinite, or it's as large as the possible space of >> keys, this would not be a problem, because the key to token mapping is one >> to one instead of many to one . >> >> best regards, >> hanzhu >> >> >> On Sat, Apr 4, 2009 at 11:59 PM, Avinash Lakshman < >> avinash.lakshman@gmail.com> wrote: >> >> Not sure what you mean by infinite. OPHF or not any hash function's range >>> is >>> practically infinite for MD5 is infinite hash range. With OPHF you will >>> have >>> a lot of skew. There are only two ways to deal with it: >>> (1) You know the space of of all your keys ahead of time and have your >>> system exploit that. >>> (2) Dynamically load balance until your system reaches steady state. >>> >>> We are always sorting lexicographically and trying to do both of the >>> above. >>> >>> Avinash >>> >>> On Sat, Apr 4, 2009 at 12:18 AM, Zhu Han wrote: >>> >>> Avinash, >>>> >>>> I support the proposal of Jonathan. >>>> >>>> Theoretically, if the system wants to support range query over >>>> >>> consistent >>> >>>> hash(DHT), the hash space should be infinite. Otherwise, a large >>>> number >>>> of >>>> keys could be mapped to the same token and it's impossible to do load >>>> balance by moving these keys around to other physical servers. >>>> >>> Especially, >>> >>>> the churn of OPHF could be very high because the user would like to >>>> >>> create >>> >>>> a >>>> lot of keys with the same long prefix for some applications. If the >>>> >>> system >>> >>>> takes the string as the token and sorts them in *lexicographic order, >>>> the >>>> hash space would be infinite, so that the churn can be solved by load >>>> balacing under any circumstances. >>>> >>>> *best regards, >>>> hanzhu >>>> >>>> >>>> On Thu, Apr 2, 2009 at 11:04 PM, Avinash Lakshman < >>>> avinash.lakshman@gmail.com> wrote: >>>> >>>> So how are you coming up with the tokens here? What do you mean by >>>>> string[0] >>>>> and string[last]? Are they the keys that belong to the system? If so >>>>> >>>> then >>> >>>> that is not a mechanism to reason about the range of your hash >>>>> >>>> function. >>> >>>> In >>>> >>>>> this scheme does it mean you will need to sort the tokens too using >>>>> collation scheme used by the external partitioner while identifying >>>>> >>>> which >>> >>>> key goes to which node. Also could you provide me with the patch >>>>> >>>> number. >>> >>>> I >>>> >>>>> need to go over that this weekend and make sure if it does not affect >>>>> >>>> the >>> >>>> (bootstrap/load balance) logic. My fear is that these changes have long >>>>> reaching effects and I need to make sure that all these pieces will >>>>> continue >>>>> to reside in harmony. Also your range query test did it run in a >>>>> distributed >>>>> environment or in a single box environment? >>>>> Avinash >>>>> >>>>> On Wed, Apr 1, 2009 at 3:34 PM, Jonathan Ellis >>>>> >>>> wrote: >>>> >>>>> >>>>> So with a String ring you just sort the strings (conceptually) and >>>>>> >>>>> you >>> >>>> wrap around from strings[last] to strings[0]. The range is entirely >>>>>> defined by what tokens you have. >>>>>> >>>>>> It's nice to be able to test against a real application. Mine won't >>>>>> run without range queries... (So it is still running against my >>>>>> github code where I first wrote range queries, but that uses the OPHF >>>>>> approach you have and so that is where I found the problems.) >>>>>> >>>>>> -Jonathan >>>>>> >>>>>> On Wed, Apr 1, 2009 at 4:27 PM, Avinash Lakshman >>>>>> wrote: >>>>>> >>>>>>> I will look into this. Another point about P2P rings is something >>>>>>> >>>>>> about >>>> >>>>> the >>>>>> >>>>>>> ability to reason about the system. In any hash fn we hve a notion >>>>>>> >>>>>> of >>> >>>> range >>>>>> >>>>>>> and how the ring wraps around. For eg. in random hash such as MD5 >>>>>>> >>>>>> we >>> >>>> say >>>>> >>>>>> 0-2^128 -1 and then the wrap around. Similarly for the hash fn used >>>>>>> >>>>>> here >>>>> >>>>>> one >>>>>> >>>>>>> could define the same. Today we perform usual sort of strings and >>>>>>> >>>>>> the >>> >>>> hash >>>>>> >>>>>>> should respect that sort order. I guess you are saying the usual >>>>>>> >>>>>> sort >>> >>>> of >>>>> >>>>>> string does not respect collation. I will look into this and I >>>>>>> >>>>>> insist >>> >>>> this >>>>>> >>>>>>> should be very doable. Changes to the implementation of the core >>>>>>> >>>>>> classes >>>>> >>>>>> albeit how simple they are very very scary unless they are >>>>>>> >>>>>> completely >>> >>>> tested >>>>>> >>>>>>> too in a distributed setting. Over here we do not have detailed >>>>>>> >>>>>> test >>> >>>> code, >>>>>> >>>>>>> but we test by directing a % of the site traffic to a test cluster >>>>>>> >>>>>> before >>>>> >>>>>> we >>>>>> >>>>>>> sign off on anything. >>>>>>> >>>>>>> Avinash >>>>>>> >>>>>>> On Wed, Apr 1, 2009 at 3:00 PM, Jonathan Ellis >>>>>>> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> Say for instance you have inserted keys ['a', 'b', '--a', '--b'] >>>>>>>> >>>>>>> and >>> >>>> you are going to do range queries on them. The correct >>>>>>>> collation-aware sort is ['a', '--a', 'b', '--b']. But ordering by >>>>>>>> char value gives ['--a', '--b', 'a', 'b']. >>>>>>>> >>>>>>>> Attached is a test program illustrating this. The assert will >>>>>>>> >>>>>>> fail >>> >>>> because the hash ordering is not the same as the collator's. >>>>>>>> >>>>>>>> -Jonathan >>>>>>>> >>>>>>>> On Wed, Apr 1, 2009 at 3:47 PM, Avinash Lakshman >>>>>>>> wrote: >>>>>>>> >>>>>>>>> In fact what would you want is hash to respect oorder based on >>>>>>>>> >>>>>>>> how >>> >>>> the >>>>> >>>>>> strings are sorted. For the example you have it does. So am I >>>>>>>>> >>>>>>>> missing >>>>> >>>>>> something? >>>>>>>>> >>>>>>>>> Avinash >>>>>>>>> >>>>>>>>> On Wed, Apr 1, 2009 at 2:43 PM, Jonathan Ellis < >>>>>>>>> >>>>>>>> jbellis@gmail.com >>> >>>> >>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> For that aspect no difference between a String ring based on >>>>>>>>>> >>>>>>>>> compareTo >>>>>> >>>>>>> and a BigInteger one. The only difference (and it is an >>>>>>>>>> >>>>>>>>> important >>>> >>>>> one >>>>>> >>>>>>> for the reasons I gave!) is how the compare works. But for the >>>>>>>>>> >>>>>>>>> p2p >>>> >>>>> aspect it does not matter. >>>>>>>>>> >>>>>>>>>> On Wed, Apr 1, 2009 at 3:40 PM, Avinash Lakshman >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Doing what you are suggesting scares the hell out of me for a >>>>>>>>>>> >>>>>>>>>> couple >>>>>> >>>>>>> of >>>>>>>> >>>>>>>>> reasons - All work in P2P be it random/OPHF does the token >>>>>>>>>>> >>>>>>>>>> handling >>>>>> >>>>>>> the >>>>>>>> >>>>>>>>> way >>>>>>>>>> >>>>>>>>>>> it is setup. I cannot try something that has not been well >>>>>>>>>>> >>>>>>>>>> explored >>>>> >>>>>> in >>>>>> >>>>>>> academia. I insist this must be doable. I am going to think >>>>>>>>>>> >>>>>>>>>> about >>>> >>>>> this >>>>>> >>>>>>> more. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Avinash >>>>>>>>>>> >>>>>>>>>>> On Wed, Apr 1, 2009 at 2:31 PM, Jonathan Ellis < >>>>>>>>>>> >>>>>>>>>> jbellis@gmail.com> >>>>> >>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Avinash already commited his new order-preserving hash >>>>>>>>>>>> >>>>>>>>>>> function >>>> >>>>> and I >>>>>> >>>>>>> missed it. It's in OrderPreservingHashPartitioner. It >>>>>>>>>>>> >>>>>>>>>>> takes >>> >>>> the >>>>> >>>>>> approach that Todd and I discussed back in January: turn the >>>>>>>>>>>> >>>>>>>>>>> string >>>>>> >>>>>>> into a base-Char.MAX_VALUE number. >>>>>>>>>>>> ( >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> http://groups.google.com/group/cassandra-dev/browse_thread/thread/6bda8518466210e7/f53b79c19010a9ed >>> >>>> ). >>>>>>>>>>>> I chatted with Avinash a little on IM but we didn't finish, >>>>>>>>>>>> >>>>>>>>>>> so >>>> >>>>> I'm >>>>>> >>>>>>> picking it up here. >>>>>>>>>>>> >>>>>>>>>>>> There are two problems with this approach. One is that the >>>>>>>>>>>> >>>>>>>>>>> hashes >>>>> >>>>>> will only be order-preserving for a subset of unicode (all >>>>>>>>>>>> >>>>>>>>>>> of >>> >>>> UCS-2, >>>>>> >>>>>>> but not all of UTF-16; see >>>>>>>>>>>> >>>>>>>>>>> http://en.wikipedia.org/wiki/UTF-16/UCS-2 >>>>>> >>>>>>> ). >>>>>>>> >>>>>>>>> The other is that this only gives you a naive ordering by >>>>>>>>>>>> >>>>>>>>>>> code >>>> >>>>> point >>>>>> >>>>>>> value, which for unicode is not what you want and even for >>>>>>>>>>>> >>>>>>>>>>> ascii >>>> >>>>> sometimes you want another collation. (see >>>>>>>>>>>> http://www.unicode.org/reports/tr10/ and >>>>>>>>>>>> >>>>>>>>>>>> http://java.sun.com/javase/6/docs/api/java/text/Collator.html >>> >>>> ). >>>> >>>>> >>>>>>>>>>>> Say for instance you have inserted keys ['a', 'b', '--a', >>>>>>>>>>>> >>>>>>>>>>> '--b'] >>>> >>>>> and >>>>>> >>>>>>> you are going to do range queries on them. The correct >>>>>>>>>>>> collation-aware sort is ['a', '--a', 'b', '--b']. But >>>>>>>>>>>> >>>>>>>>>>> ordering >>>> >>>>> by >>>>> >>>>>> char value gives ['--a', '--b', 'a', 'b']. >>>>>>>>>>>> >>>>>>>>>>>> Switching to a more flexibile system like the one I wrote >>>>>>>>>>>> >>>>>>>>>>> for >>> >>>> CASSANDRA-3 will let use use Token for random >>>>>>>>>>>> >>>>>>>>>>> distribution >>>>>>>> >>>>>>>>> or Token for order-preserving, with user-defined >>>>>>>>>>>> >>>>>>>>>>> collation. >>>>>> >>>>>>> I >>>>>>>> >>>>>>>>> don't see a way to get this kind of flexibility from an >>>>>>>>>>>> >>>>>>>>>>> approach >>>> >>>>> that >>>>>> >>>>>>> insists on turning everything into BigInteger. >>>>>>>>>>>> >>>>>>>>>>>> -Jonathan >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Mar 30, 2009 at 2:16 PM, Jonathan Ellis < >>>>>>>>>>>> >>>>>>>>>>> jbellis@gmail.com> >>>>>> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Avinash, >>>>>>>>>>>>> >>>>>>>>>>>>> You mentioned that you have a new order-preserving hash >>>>>>>>>>>>> >>>>>>>>>>>> function >>>>> >>>>>> that >>>>>>>> >>>>>>>>> you think will be more generally useful. Can you post it? >>>>>>>>>>>>> >>>>>>>>>>>>> thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> -Jonathan >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> --00163646d8705fc2b50466c83e41--