Mailing-List: contact cassandra-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: cassandra-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of schumi.han@gmail.com
 designates 209.85.146.178 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=rRpZGLDCzHJK56RmjJLoAwWIMhWR+1od7RkwFPhwrmQ55gkq9uLUR8OQLV1IMQc+Ye
         i4A2f+prKw2q29CZLEHCM7hQAdJadu6rlOIP9cr3RZsE/53Ob4U5Z+VlrQ+TGTAwuiGr
         SQW/Onq2JPNDTMnzwGwOsAlDDtNllLKzOQKoc=
MIME-Version: 1.0
In-Reply-To: <1DB35D6C-485E-4A7B-B6B9-11409C90D124@gmail.com>
References: <e06563880903301316q2ceea26tdf70fec93b53dfa5@mail.gmail.com>
	 <a06de5520904011447q5055039dkeb68ce75439640f3@mail.gmail.com>
	 <e06563880904011500ic71c195ia68a81a13fc6b55c@mail.gmail.com>
	 <a06de5520904011527q54eca2d0k205ccb07d7163793@mail.gmail.com>
	 <e06563880904011534r1bcc3a34q5a31e93bdd6242d9@mail.gmail.com>
	 <a06de5520904020804k3715acdk950e45740855a22f@mail.gmail.com>
	 <4e777ed10904040018r6928cc6td21eace06c0b99ef@mail.gmail.com>
	 <a06de5520904040859w4273a83fq479144bb62ca9f26@mail.gmail.com>
	 <4e777ed10904042032w55fffc91vd94b450d0aae17c9@mail.gmail.com>
	 <1DB35D6C-485E-4A7B-B6B9-11409C90D124@gmail.com>
Date: Sun, 5 Apr 2009 13:42:12 +0800
Message-ID: <4e777ed10904042242o38f38277xb27ba0332a7e524e@mail.gmail.com>
Subject: Re: OPHF
From: Zhu Han <schumi.han@gmail.com>
To: cassandra-dev@incubator.apache.org
Content-Type: multipart/alternative; boundary=00163646d8705fc2b50466c83e41

--00163646d8705fc2b50466c83e41
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

I see. Thanks!

best regards,
hanzhu


On Sun, Apr 5, 2009 at 12:24 PM, Jonathan ellis <jbellis@gmail.com> wrote:

> Note that Avinash's ophf is 1-1 up to the max key length.  But this has its
> own problems.  (Mainly, that keys cannot be arbitrarily long, and hash
> computation gets more expensive proportional to max key length.)
>
> -Jonathan
>
>
> On Apr 4, 2009, at 10:32 PM, Zhu Han <schumi.han@gmail.com> wrote:
>
>  Avinash,
>>
>> (For cassandra, the "key" mentioned in following words is the name of the
>> row. The "token" mentioned in following words, is the output of the hash
>> function. )
>>
>> For example, if you use 160bit md5 hash value as the space, it's finite
>> theoretically, because you know the number of tokens which can be
>> contained
>> ahead of time, although the number is pretty large. For finite hash space,
>> the mapping from key to token is many to one instead of one to one.
>>
>> For OHPF, a lot of keys could be mapped to  the same token range because
>> of
>> the non-uniform distribution just as you pointed out. The load balancing
>> can
>> solve the problem. However, if too may keys mapped to *the same token*
>> which is determined by the hash function, how can the load balance be
>> carried out? When the hash space is finite, it's possible that a large
>> number of keys are mapped to the same token. However, if the hash function
>> is uniformly distributed, just as MD5 or SHA1, the probability of the
>> worse
>> case is pretty rare, at least this case  cannot be produced by human
>> purposely,  so that's not a problem. But for OPHF, if the clients pick up
>> the name of key purposely, e.g. for DOS attack, the worst case can occur
>> practically.  When the attacker knows the algorithm of the OPHF, he can
>> produce arbitrary number of keys mapped to the same token.
>>
>> If the hash space is infinite, or it's as large as the possible space of
>> keys, this would not be a problem, because the key to token mapping is one
>> to one instead of many to one .
>>
>> best regards,
>> hanzhu
>>
>>
>> On Sat, Apr 4, 2009 at 11:59 PM, Avinash Lakshman <
>> avinash.lakshman@gmail.com> wrote:
>>
>>  Not sure what you mean by infinite. OPHF or not any hash function's range
>>> is
>>> practically infinite for MD5 is infinite hash range. With OPHF you will
>>> have
>>> a lot of skew. There are only two ways to deal with it:
>>> (1) You know the space of of all your keys ahead of time and have your
>>> system exploit that.
>>> (2) Dynamically load balance until your system reaches steady state.
>>>
>>> We are always sorting lexicographically and trying to do both of the
>>> above.
>>>
>>> Avinash
>>>
>>> On Sat, Apr 4, 2009 at 12:18 AM, Zhu Han <schumi.han@gmail.com> wrote:
>>>
>>>  Avinash,
>>>>
>>>> I support the proposal of Jonathan.
>>>>
>>>> Theoretically,  if the system wants to support range query over
>>>>
>>> consistent
>>>
>>>> hash(DHT),  the hash space should be infinite. Otherwise,  a large
>>>> number
>>>> of
>>>> keys could be mapped to the same token and  it's impossible to do load
>>>> balance by moving these keys around to other physical servers.
>>>>
>>> Especially,
>>>
>>>> the churn of OPHF could be very high because the user would like to
>>>>
>>> create
>>>
>>>> a
>>>> lot of keys with the same long prefix for some applications. If the
>>>>
>>> system
>>>
>>>> takes the string as the token and sorts them in *lexicographic order,
>>>> the
>>>> hash space would be infinite, so that the churn can be solved by load
>>>> balacing under any circumstances.
>>>>
>>>> *best regards,
>>>> hanzhu
>>>>
>>>>
>>>> On Thu, Apr 2, 2009 at 11:04 PM, Avinash Lakshman <
>>>> avinash.lakshman@gmail.com> wrote:
>>>>
>>>>  So how are you coming up with the tokens here? What do you mean by
>>>>> string[0]
>>>>> and string[last]? Are they the keys that belong to the system? If so
>>>>>
>>>> then
>>>
>>>> that is not a mechanism to reason about the range of your hash
>>>>>
>>>> function.
>>>
>>>> In
>>>>
>>>>> this scheme does it mean you will need to sort the tokens too using
>>>>> collation scheme used by the external partitioner while identifying
>>>>>
>>>> which
>>>
>>>> key goes to which node. Also could you provide me with the patch
>>>>>
>>>> number.
>>>
>>>> I
>>>>
>>>>> need to go over that this weekend and make sure if it does not affect
>>>>>
>>>> the
>>>
>>>> (bootstrap/load balance) logic. My fear is that these changes have long
>>>>> reaching effects and I need to make sure that all these pieces will
>>>>> continue
>>>>> to reside in harmony. Also your range query test did it run in a
>>>>> distributed
>>>>> environment or in a single box environment?
>>>>> Avinash
>>>>>
>>>>> On Wed, Apr 1, 2009 at 3:34 PM, Jonathan Ellis <jbellis@gmail.com>
>>>>>
>>>> wrote:
>>>>
>>>>>
>>>>>  So with a String ring you just sort the strings (conceptually) and
>>>>>>
>>>>> you
>>>
>>>> wrap around from strings[last] to strings[0].  The range is entirely
>>>>>> defined by what tokens you have.
>>>>>>
>>>>>> It's nice to be able to test against a real application.  Mine won't
>>>>>> run without range queries...  (So it is still running against my
>>>>>> github code where I first wrote range queries, but that uses the OPHF
>>>>>> approach you have and so that is where I found the problems.)
>>>>>>
>>>>>> -Jonathan
>>>>>>
>>>>>> On Wed, Apr 1, 2009 at 4:27 PM, Avinash Lakshman
>>>>>> <avinash.lakshman@gmail.com> wrote:
>>>>>>
>>>>>>> I will look into this. Another point about P2P rings is something
>>>>>>>
>>>>>> about
>>>>
>>>>> the
>>>>>>
>>>>>>> ability to reason about the system. In any hash fn we hve a notion
>>>>>>>
>>>>>> of
>>>
>>>> range
>>>>>>
>>>>>>> and how the ring wraps around. For eg. in random hash such as MD5
>>>>>>>
>>>>>> we
>>>
>>>> say
>>>>>
>>>>>> 0-2^128 -1 and then the wrap around. Similarly for the hash fn used
>>>>>>>
>>>>>> here
>>>>>
>>>>>> one
>>>>>>
>>>>>>> could define the same. Today we perform usual sort of strings and
>>>>>>>
>>>>>> the
>>>
>>>> hash
>>>>>>
>>>>>>> should respect that sort order. I guess you are saying the usual
>>>>>>>
>>>>>> sort
>>>
>>>> of
>>>>>
>>>>>> string does not respect collation. I will look into this and I
>>>>>>>
>>>>>> insist
>>>
>>>> this
>>>>>>
>>>>>>> should be very doable. Changes to the implementation of the core
>>>>>>>
>>>>>> classes
>>>>>
>>>>>> albeit how simple they are very very scary unless they are
>>>>>>>
>>>>>> completely
>>>
>>>> tested
>>>>>>
>>>>>>> too in a distributed setting. Over here we do not have detailed
>>>>>>>
>>>>>> test
>>>
>>>> code,
>>>>>>
>>>>>>> but we test by directing a % of the site traffic to a test cluster
>>>>>>>
>>>>>> before
>>>>>
>>>>>> we
>>>>>>
>>>>>>> sign off on anything.
>>>>>>>
>>>>>>> Avinash
>>>>>>>
>>>>>>> On Wed, Apr 1, 2009 at 3:00 PM, Jonathan Ellis <jbellis@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>  Say for instance you have inserted keys ['a', 'b', '--a', '--b']
>>>>>>>>
>>>>>>> and
>>>
>>>> you are going to do range queries on them.  The correct
>>>>>>>> collation-aware sort is ['a', '--a', 'b', '--b'].  But ordering by
>>>>>>>> char value gives ['--a', '--b', 'a', 'b'].
>>>>>>>>
>>>>>>>> Attached is a test program illustrating this.  The assert will
>>>>>>>>
>>>>>>> fail
>>>
>>>> because the hash ordering is not the same as the collator's.
>>>>>>>>
>>>>>>>> -Jonathan
>>>>>>>>
>>>>>>>> On Wed, Apr 1, 2009 at 3:47 PM, Avinash Lakshman
>>>>>>>> <avinash.lakshman@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> In fact what would you want is hash to respect oorder based on
>>>>>>>>>
>>>>>>>> how
>>>
>>>> the
>>>>>
>>>>>> strings are sorted. For the example you have it does. So am I
>>>>>>>>>
>>>>>>>> missing
>>>>>
>>>>>> something?
>>>>>>>>>
>>>>>>>>> Avinash
>>>>>>>>>
>>>>>>>>> On Wed, Apr 1, 2009 at 2:43 PM, Jonathan Ellis <
>>>>>>>>>
>>>>>>>> jbellis@gmail.com
>>>
>>>>
>>>>>  wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>  For that aspect no difference between a String ring based on
>>>>>>>>>>
>>>>>>>>> compareTo
>>>>>>
>>>>>>> and a BigInteger one.  The only difference (and it is an
>>>>>>>>>>
>>>>>>>>> important
>>>>
>>>>> one
>>>>>>
>>>>>>> for the reasons I gave!) is how the compare works.  But for the
>>>>>>>>>>
>>>>>>>>> p2p
>>>>
>>>>> aspect it does not matter.
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 1, 2009 at 3:40 PM, Avinash Lakshman
>>>>>>>>>> <avinash.lakshman@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Doing what you are suggesting scares the hell out of me for a
>>>>>>>>>>>
>>>>>>>>>> couple
>>>>>>
>>>>>>> of
>>>>>>>>
>>>>>>>>> reasons -  All work in P2P be it random/OPHF does the token
>>>>>>>>>>>
>>>>>>>>>> handling
>>>>>>
>>>>>>> the
>>>>>>>>
>>>>>>>>> way
>>>>>>>>>>
>>>>>>>>>>> it is setup. I cannot try something that has not been well
>>>>>>>>>>>
>>>>>>>>>> explored
>>>>>
>>>>>> in
>>>>>>
>>>>>>> academia. I insist this must be doable. I am going to think
>>>>>>>>>>>
>>>>>>>>>> about
>>>>
>>>>> this
>>>>>>
>>>>>>> more.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Avinash
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 1, 2009 at 2:31 PM, Jonathan Ellis <
>>>>>>>>>>>
>>>>>>>>>> jbellis@gmail.com>
>>>>>
>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Avinash already commited his new order-preserving hash
>>>>>>>>>>>>
>>>>>>>>>>> function
>>>>
>>>>> and I
>>>>>>
>>>>>>> missed it.  It's in OrderPreservingHashPartitioner.  It
>>>>>>>>>>>>
>>>>>>>>>>> takes
>>>
>>>> the
>>>>>
>>>>>> approach that Todd and I discussed back in January: turn the
>>>>>>>>>>>>
>>>>>>>>>>> string
>>>>>>
>>>>>>> into a base-Char.MAX_VALUE number.
>>>>>>>>>>>> (
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>> http://groups.google.com/group/cassandra-dev/browse_thread/thread/6bda8518466210e7/f53b79c19010a9ed
>>>
>>>> ).
>>>>>>>>>>>> I chatted with Avinash a little on IM but we didn't finish,
>>>>>>>>>>>>
>>>>>>>>>>> so
>>>>
>>>>> I'm
>>>>>>
>>>>>>> picking it up here.
>>>>>>>>>>>>
>>>>>>>>>>>> There are two problems with this approach.  One is that the
>>>>>>>>>>>>
>>>>>>>>>>> hashes
>>>>>
>>>>>> will only be order-preserving for a subset of unicode (all
>>>>>>>>>>>>
>>>>>>>>>>> of
>>>
>>>> UCS-2,
>>>>>>
>>>>>>> but not all of UTF-16; see
>>>>>>>>>>>>
>>>>>>>>>>> http://en.wikipedia.org/wiki/UTF-16/UCS-2
>>>>>>
>>>>>>> ).
>>>>>>>>
>>>>>>>>> The other is that this only gives you a naive ordering by
>>>>>>>>>>>>
>>>>>>>>>>> code
>>>>
>>>>> point
>>>>>>
>>>>>>> value, which for unicode is not what you want and even for
>>>>>>>>>>>>
>>>>>>>>>>> ascii
>>>>
>>>>> sometimes you want another collation.  (see
>>>>>>>>>>>> http://www.unicode.org/reports/tr10/ and
>>>>>>>>>>>>
>>>>>>>>>>>>  http://java.sun.com/javase/6/docs/api/java/text/Collator.html
>>>
>>>> ).
>>>>
>>>>>
>>>>>>>>>>>> Say for instance you have inserted keys ['a', 'b', '--a',
>>>>>>>>>>>>
>>>>>>>>>>> '--b']
>>>>
>>>>> and
>>>>>>
>>>>>>> you are going to do range queries on them.  The correct
>>>>>>>>>>>> collation-aware sort is ['a', '--a', 'b', '--b'].  But
>>>>>>>>>>>>
>>>>>>>>>>> ordering
>>>>
>>>>> by
>>>>>
>>>>>> char value gives ['--a', '--b', 'a', 'b'].
>>>>>>>>>>>>
>>>>>>>>>>>> Switching to a more flexibile system like the one I wrote
>>>>>>>>>>>>
>>>>>>>>>>> for
>>>
>>>> CASSANDRA-3 will let use use Token<BigInteger> for random
>>>>>>>>>>>>
>>>>>>>>>>> distribution
>>>>>>>>
>>>>>>>>> or Token<String> for order-preserving, with user-defined
>>>>>>>>>>>>
>>>>>>>>>>> collation.
>>>>>>
>>>>>>> I
>>>>>>>>
>>>>>>>>> don't see a way to get this kind of flexibility from an
>>>>>>>>>>>>
>>>>>>>>>>> approach
>>>>
>>>>> that
>>>>>>
>>>>>>> insists on turning everything into BigInteger.
>>>>>>>>>>>>
>>>>>>>>>>>> -Jonathan
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Mar 30, 2009 at 2:16 PM, Jonathan Ellis <
>>>>>>>>>>>>
>>>>>>>>>>> jbellis@gmail.com>
>>>>>>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Avinash,
>>>>>>>>>>>>>
>>>>>>>>>>>>> You mentioned that you have a new order-preserving hash
>>>>>>>>>>>>>
>>>>>>>>>>>> function
>>>>>
>>>>>> that
>>>>>>>>
>>>>>>>>> you think will be more generally useful.  Can you post it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jonathan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>

--00163646d8705fc2b50466c83e41--