cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <>
Subject Re: OPHF vs. Random
Date Mon, 16 Mar 2009 21:38:14 GMT
I think that key -> endpoint might still be simpler long term but
short term there is far too much code that depends on being able to
compare both nodes and keys transformed to tokens.  Previously token
was hardcoded to be BigInteger but I introduced the abstraction
Token<T> defining compareTo(Token), so you can have Token<BigInteger>
as well as Token<String>.  The OrderPreservingPartitioner then uses
Token<String> to do lexicographic comparisons.


On Mon, Mar 16, 2009 at 3:30 PM, Sandeep Tata <> wrote:
> I like the idea of supporting more general/sophisticated strategies.
> Let me see if I understand the issues at play here:
> OPHFs are tricky to design and leaning on it for load-balancing and
> data locality will require incredibly good OPHFs that might not exist.
> (I learned this with a bunch of the experiments we ran on our
> relatively small test cluster)
> RANDOM of course is going to be great for load-balancing, but we're
> completely giving up locality, so range queries are shot.
> If we want to support clever placement strategies, we'll need to make
> some changes. Take for instance a key like "userid.messageid" . I
> might want:
> a) all the keys with the same userid on the same node, and
> b) all the messageids stored in order so I can do simple range queries
> like "get message 1 to 100"
> OPHF might break a) and RANDOM will break b)
> The claim is that the simplest (best :-) ) way to guarantee a) and b)
> is to map the key to an end-point instead of merely an integer.
> What if I changed the hash-function to do RANDOM on just the "userid"
> part. And each node still stores the keys in "<" order on the entire
> key ("userid.messageid") Would this solve the problem? What is this
> approach missing?
> Do we just need to decouple the hash used for routing from the key
> used in the end-point for storage? Is this essentially what the series
> of patches does?
> On Mon, Mar 16, 2009 at 1:36 PM, Jonathan Ellis <> wrote:
>> The order-preserving partitioner code (not hash-based anymore) is up
>> now at
>> -Jonathan
>> On Wed, Mar 11, 2009 at 6:48 PM, Jonathan Ellis <> wrote:
>>> Use Random for now.  The OPHF is the same as the old one, i.e., not
>>> actually OP. :)
>>> I'm pretty convinced at this point that it's impossible to have an
>>> order-preserving hash that doesn't either (a) impose a relatively
>>> short key length past which no partitioning is done (i.e., all keys w/
>>> the same prefix go to the same node) or is (b) very sensitive to key
>>> length such that the keys with a given length N will not be evenly
>>> distributed across all nodes. Or both.
>>> So I am working on migrating from pluggable hash functions key ->
>>> BigInteger, to pluggable partitioning algorithms key -> EndPoint.
>>> Without the requirement to transform to a numeric value first I think
>>> I can create an order-preserving distribution that performs well.  (I
>>> need this for range queries.)
>>> So far I have just laid the foundation, here:
>>> I hope to finish the rest tomorrow.
>>> -Jonathan
>>> On Wed, Mar 11, 2009 at 5:28 PM, Jiansheng Huang <>
>>>> Which one is better to use? The default is Random.
>>>> In Avinash's annoucement mail, we have
>>>> (1) Ability to switch between a random hash and a OPHF. We still have the
>>>> old (wrong) OPHF in there. I will update it to the corrected one tomorrow.
>>>> Is correct OPHF in? Thanks.

View raw message