incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhu Han <schumi....@gmail.com>
Subject Re: OPHF
Date Sun, 05 Apr 2009 05:42:12 GMT
I see. Thanks!

best regards,
hanzhu


On Sun, Apr 5, 2009 at 12:24 PM, Jonathan ellis <jbellis@gmail.com> wrote:

> Note that Avinash's ophf is 1-1 up to the max key length.  But this has its
> own problems.  (Mainly, that keys cannot be arbitrarily long, and hash
> computation gets more expensive proportional to max key length.)
>
> -Jonathan
>
>
> On Apr 4, 2009, at 10:32 PM, Zhu Han <schumi.han@gmail.com> wrote:
>
>  Avinash,
>>
>> (For cassandra, the "key" mentioned in following words is the name of the
>> row. The "token" mentioned in following words, is the output of the hash
>> function. )
>>
>> For example, if you use 160bit md5 hash value as the space, it's finite
>> theoretically, because you know the number of tokens which can be
>> contained
>> ahead of time, although the number is pretty large. For finite hash space,
>> the mapping from key to token is many to one instead of one to one.
>>
>> For OHPF, a lot of keys could be mapped to  the same token range because
>> of
>> the non-uniform distribution just as you pointed out. The load balancing
>> can
>> solve the problem. However, if too may keys mapped to *the same token*
>> which is determined by the hash function, how can the load balance be
>> carried out? When the hash space is finite, it's possible that a large
>> number of keys are mapped to the same token. However, if the hash function
>> is uniformly distributed, just as MD5 or SHA1, the probability of the
>> worse
>> case is pretty rare, at least this case  cannot be produced by human
>> purposely,  so that's not a problem. But for OPHF, if the clients pick up
>> the name of key purposely, e.g. for DOS attack, the worst case can occur
>> practically.  When the attacker knows the algorithm of the OPHF, he can
>> produce arbitrary number of keys mapped to the same token.
>>
>> If the hash space is infinite, or it's as large as the possible space of
>> keys, this would not be a problem, because the key to token mapping is one
>> to one instead of many to one .
>>
>> best regards,
>> hanzhu
>>
>>
>> On Sat, Apr 4, 2009 at 11:59 PM, Avinash Lakshman <
>> avinash.lakshman@gmail.com> wrote:
>>
>>  Not sure what you mean by infinite. OPHF or not any hash function's range
>>> is
>>> practically infinite for MD5 is infinite hash range. With OPHF you will
>>> have
>>> a lot of skew. There are only two ways to deal with it:
>>> (1) You know the space of of all your keys ahead of time and have your
>>> system exploit that.
>>> (2) Dynamically load balance until your system reaches steady state.
>>>
>>> We are always sorting lexicographically and trying to do both of the
>>> above.
>>>
>>> Avinash
>>>
>>> On Sat, Apr 4, 2009 at 12:18 AM, Zhu Han <schumi.han@gmail.com> wrote:
>>>
>>>  Avinash,
>>>>
>>>> I support the proposal of Jonathan.
>>>>
>>>> Theoretically,  if the system wants to support range query over
>>>>
>>> consistent
>>>
>>>> hash(DHT),  the hash space should be infinite. Otherwise,  a large
>>>> number
>>>> of
>>>> keys could be mapped to the same token and  it's impossible to do load
>>>> balance by moving these keys around to other physical servers.
>>>>
>>> Especially,
>>>
>>>> the churn of OPHF could be very high because the user would like to
>>>>
>>> create
>>>
>>>> a
>>>> lot of keys with the same long prefix for some applications. If the
>>>>
>>> system
>>>
>>>> takes the string as the token and sorts them in *lexicographic order,
>>>> the
>>>> hash space would be infinite, so that the churn can be solved by load
>>>> balacing under any circumstances.
>>>>
>>>> *best regards,
>>>> hanzhu
>>>>
>>>>
>>>> On Thu, Apr 2, 2009 at 11:04 PM, Avinash Lakshman <
>>>> avinash.lakshman@gmail.com> wrote:
>>>>
>>>>  So how are you coming up with the tokens here? What do you mean by
>>>>> string[0]
>>>>> and string[last]? Are they the keys that belong to the system? If so
>>>>>
>>>> then
>>>
>>>> that is not a mechanism to reason about the range of your hash
>>>>>
>>>> function.
>>>
>>>> In
>>>>
>>>>> this scheme does it mean you will need to sort the tokens too using
>>>>> collation scheme used by the external partitioner while identifying
>>>>>
>>>> which
>>>
>>>> key goes to which node. Also could you provide me with the patch
>>>>>
>>>> number.
>>>
>>>> I
>>>>
>>>>> need to go over that this weekend and make sure if it does not affect
>>>>>
>>>> the
>>>
>>>> (bootstrap/load balance) logic. My fear is that these changes have long
>>>>> reaching effects and I need to make sure that all these pieces will
>>>>> continue
>>>>> to reside in harmony. Also your range query test did it run in a
>>>>> distributed
>>>>> environment or in a single box environment?
>>>>> Avinash
>>>>>
>>>>> On Wed, Apr 1, 2009 at 3:34 PM, Jonathan Ellis <jbellis@gmail.com>
>>>>>
>>>> wrote:
>>>>
>>>>>
>>>>>  So with a String ring you just sort the strings (conceptually) and
>>>>>>
>>>>> you
>>>
>>>> wrap around from strings[last] to strings[0].  The range is entirely
>>>>>> defined by what tokens you have.
>>>>>>
>>>>>> It's nice to be able to test against a real application.  Mine won't
>>>>>> run without range queries...  (So it is still running against my
>>>>>> github code where I first wrote range queries, but that uses the
OPHF
>>>>>> approach you have and so that is where I found the problems.)
>>>>>>
>>>>>> -Jonathan
>>>>>>
>>>>>> On Wed, Apr 1, 2009 at 4:27 PM, Avinash Lakshman
>>>>>> <avinash.lakshman@gmail.com> wrote:
>>>>>>
>>>>>>> I will look into this. Another point about P2P rings is something
>>>>>>>
>>>>>> about
>>>>
>>>>> the
>>>>>>
>>>>>>> ability to reason about the system. In any hash fn we hve a notion
>>>>>>>
>>>>>> of
>>>
>>>> range
>>>>>>
>>>>>>> and how the ring wraps around. For eg. in random hash such as
MD5
>>>>>>>
>>>>>> we
>>>
>>>> say
>>>>>
>>>>>> 0-2^128 -1 and then the wrap around. Similarly for the hash fn used
>>>>>>>
>>>>>> here
>>>>>
>>>>>> one
>>>>>>
>>>>>>> could define the same. Today we perform usual sort of strings
and
>>>>>>>
>>>>>> the
>>>
>>>> hash
>>>>>>
>>>>>>> should respect that sort order. I guess you are saying the usual
>>>>>>>
>>>>>> sort
>>>
>>>> of
>>>>>
>>>>>> string does not respect collation. I will look into this and I
>>>>>>>
>>>>>> insist
>>>
>>>> this
>>>>>>
>>>>>>> should be very doable. Changes to the implementation of the core
>>>>>>>
>>>>>> classes
>>>>>
>>>>>> albeit how simple they are very very scary unless they are
>>>>>>>
>>>>>> completely
>>>
>>>> tested
>>>>>>
>>>>>>> too in a distributed setting. Over here we do not have detailed
>>>>>>>
>>>>>> test
>>>
>>>> code,
>>>>>>
>>>>>>> but we test by directing a % of the site traffic to a test cluster
>>>>>>>
>>>>>> before
>>>>>
>>>>>> we
>>>>>>
>>>>>>> sign off on anything.
>>>>>>>
>>>>>>> Avinash
>>>>>>>
>>>>>>> On Wed, Apr 1, 2009 at 3:00 PM, Jonathan Ellis <jbellis@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>  Say for instance you have inserted keys ['a', 'b', '--a', '--b']
>>>>>>>>
>>>>>>> and
>>>
>>>> you are going to do range queries on them.  The correct
>>>>>>>> collation-aware sort is ['a', '--a', 'b', '--b'].  But ordering
by
>>>>>>>> char value gives ['--a', '--b', 'a', 'b'].
>>>>>>>>
>>>>>>>> Attached is a test program illustrating this.  The assert
will
>>>>>>>>
>>>>>>> fail
>>>
>>>> because the hash ordering is not the same as the collator's.
>>>>>>>>
>>>>>>>> -Jonathan
>>>>>>>>
>>>>>>>> On Wed, Apr 1, 2009 at 3:47 PM, Avinash Lakshman
>>>>>>>> <avinash.lakshman@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> In fact what would you want is hash to respect oorder
based on
>>>>>>>>>
>>>>>>>> how
>>>
>>>> the
>>>>>
>>>>>> strings are sorted. For the example you have it does. So am I
>>>>>>>>>
>>>>>>>> missing
>>>>>
>>>>>> something?
>>>>>>>>>
>>>>>>>>> Avinash
>>>>>>>>>
>>>>>>>>> On Wed, Apr 1, 2009 at 2:43 PM, Jonathan Ellis <
>>>>>>>>>
>>>>>>>> jbellis@gmail.com
>>>
>>>>
>>>>>  wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>  For that aspect no difference between a String ring
based on
>>>>>>>>>>
>>>>>>>>> compareTo
>>>>>>
>>>>>>> and a BigInteger one.  The only difference (and it is an
>>>>>>>>>>
>>>>>>>>> important
>>>>
>>>>> one
>>>>>>
>>>>>>> for the reasons I gave!) is how the compare works.  But for the
>>>>>>>>>>
>>>>>>>>> p2p
>>>>
>>>>> aspect it does not matter.
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 1, 2009 at 3:40 PM, Avinash Lakshman
>>>>>>>>>> <avinash.lakshman@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Doing what you are suggesting scares the hell
out of me for a
>>>>>>>>>>>
>>>>>>>>>> couple
>>>>>>
>>>>>>> of
>>>>>>>>
>>>>>>>>> reasons -  All work in P2P be it random/OPHF does the
token
>>>>>>>>>>>
>>>>>>>>>> handling
>>>>>>
>>>>>>> the
>>>>>>>>
>>>>>>>>> way
>>>>>>>>>>
>>>>>>>>>>> it is setup. I cannot try something that has
not been well
>>>>>>>>>>>
>>>>>>>>>> explored
>>>>>
>>>>>> in
>>>>>>
>>>>>>> academia. I insist this must be doable. I am going to think
>>>>>>>>>>>
>>>>>>>>>> about
>>>>
>>>>> this
>>>>>>
>>>>>>> more.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Avinash
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 1, 2009 at 2:31 PM, Jonathan Ellis
<
>>>>>>>>>>>
>>>>>>>>>> jbellis@gmail.com>
>>>>>
>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  Avinash already commited his new order-preserving
hash
>>>>>>>>>>>>
>>>>>>>>>>> function
>>>>
>>>>> and I
>>>>>>
>>>>>>> missed it.  It's in OrderPreservingHashPartitioner.  It
>>>>>>>>>>>>
>>>>>>>>>>> takes
>>>
>>>> the
>>>>>
>>>>>> approach that Todd and I discussed back in January: turn the
>>>>>>>>>>>>
>>>>>>>>>>> string
>>>>>>
>>>>>>> into a base-Char.MAX_VALUE number.
>>>>>>>>>>>> (
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>> http://groups.google.com/group/cassandra-dev/browse_thread/thread/6bda8518466210e7/f53b79c19010a9ed
>>>
>>>> ).
>>>>>>>>>>>> I chatted with Avinash a little on IM but
we didn't finish,
>>>>>>>>>>>>
>>>>>>>>>>> so
>>>>
>>>>> I'm
>>>>>>
>>>>>>> picking it up here.
>>>>>>>>>>>>
>>>>>>>>>>>> There are two problems with this approach.
 One is that the
>>>>>>>>>>>>
>>>>>>>>>>> hashes
>>>>>
>>>>>> will only be order-preserving for a subset of unicode (all
>>>>>>>>>>>>
>>>>>>>>>>> of
>>>
>>>> UCS-2,
>>>>>>
>>>>>>> but not all of UTF-16; see
>>>>>>>>>>>>
>>>>>>>>>>> http://en.wikipedia.org/wiki/UTF-16/UCS-2
>>>>>>
>>>>>>> ).
>>>>>>>>
>>>>>>>>> The other is that this only gives you a naive ordering
by
>>>>>>>>>>>>
>>>>>>>>>>> code
>>>>
>>>>> point
>>>>>>
>>>>>>> value, which for unicode is not what you want and even for
>>>>>>>>>>>>
>>>>>>>>>>> ascii
>>>>
>>>>> sometimes you want another collation.  (see
>>>>>>>>>>>> http://www.unicode.org/reports/tr10/ and
>>>>>>>>>>>>
>>>>>>>>>>>>  http://java.sun.com/javase/6/docs/api/java/text/Collator.html
>>>
>>>> ).
>>>>
>>>>>
>>>>>>>>>>>> Say for instance you have inserted keys ['a',
'b', '--a',
>>>>>>>>>>>>
>>>>>>>>>>> '--b']
>>>>
>>>>> and
>>>>>>
>>>>>>> you are going to do range queries on them.  The correct
>>>>>>>>>>>> collation-aware sort is ['a', '--a', 'b',
'--b'].  But
>>>>>>>>>>>>
>>>>>>>>>>> ordering
>>>>
>>>>> by
>>>>>
>>>>>> char value gives ['--a', '--b', 'a', 'b'].
>>>>>>>>>>>>
>>>>>>>>>>>> Switching to a more flexibile system like
the one I wrote
>>>>>>>>>>>>
>>>>>>>>>>> for
>>>
>>>> CASSANDRA-3 will let use use Token<BigInteger> for random
>>>>>>>>>>>>
>>>>>>>>>>> distribution
>>>>>>>>
>>>>>>>>> or Token<String> for order-preserving, with user-defined
>>>>>>>>>>>>
>>>>>>>>>>> collation.
>>>>>>
>>>>>>> I
>>>>>>>>
>>>>>>>>> don't see a way to get this kind of flexibility from
an
>>>>>>>>>>>>
>>>>>>>>>>> approach
>>>>
>>>>> that
>>>>>>
>>>>>>> insists on turning everything into BigInteger.
>>>>>>>>>>>>
>>>>>>>>>>>> -Jonathan
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Mar 30, 2009 at 2:16 PM, Jonathan
Ellis <
>>>>>>>>>>>>
>>>>>>>>>>> jbellis@gmail.com>
>>>>>>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Avinash,
>>>>>>>>>>>>>
>>>>>>>>>>>>> You mentioned that you have a new order-preserving
hash
>>>>>>>>>>>>>
>>>>>>>>>>>> function
>>>>>
>>>>>> that
>>>>>>>>
>>>>>>>>> you think will be more generally useful.  Can you post
it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jonathan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message