incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Robson <mar...@gmail.com>
Subject Re: using cassandra as a real time DW
Date Fri, 06 Nov 2009 22:14:36 GMT
2009/11/6 Joe Stump <joe@joestump.net>

> On Nov 6, 2009, at 2:35 PM, Mark Robson wrote:
>
> 2009/11/6 Joe Stump <joe@joestump.net>
>
>>
>> Can you explain what you mean by lack of load balancing?
>>
>
>
> Nothing in Cassandra attempts to ensure that your data are equally spread
> over the different nodes (yet; there are several bugs open to this effect).
>
>
> That's not true from my understanding. It won't put three copies on the
> same node. The key word, I suppose, is "equally".
>

The three copies will generally be on the three nodes sequentially in the
ring, starting at the one nearest to the key.

However, if you have a range of keys that goes from 0000 to 004f say, and
your nodes have tokens 0,2,4,6,8,a,c and e, then you won't get an even
distribution, instead all the data will sit entirely on the first three
nodes with the others completely empty.

It doesn't know to space the tokens evenly throughout the key space. It also
won't change the token of an existing node (Bootstrap can insert new nodes
into the ring and copy / prune the data as necessary, which is a Good
Thing).

You *can* manually assign the tokens and that can be used as a work-around,
if you know what the distribution of your tokens is or is likely to be.

You can also construct your keys carefully such that the tokens are likely
to be equally spaced within them (e.g. by using a hash of something for the
first part of your key).

Other clustered databases (e.g. Hadoop-based things possibly?) split the
data into chunks which then get distributed among the nodes on some
load-balanced basis; Cassandra does not do this yet.

Mark

Mime
View raw message