2009/11/6 Joe Stump <joe@joestump.net>
On Nov 6, 2009, at 2:35 PM, Mark Robson wrote:

2009/11/6 Joe Stump <joe@joestump.net>

Can you explain what you mean by lack of load balancing?

Nothing in Cassandra attempts to ensure that your data are equally spread over the different nodes (yet; there are several bugs open to this effect).

That's not true from my understanding. It won't put three copies on the same node. The key word, I suppose, is "equally". 

The three copies will generally be on the three nodes sequentially in the ring, starting at the one nearest to the key.

However, if you have a range of keys that goes from 0000 to 004f say, and your nodes have tokens 0,2,4,6,8,a,c and e, then you won't get an even distribution, instead all the data will sit entirely on the first three nodes with the others completely empty.

It doesn't know to space the tokens evenly throughout the key space. It also won't change the token of an existing node (Bootstrap can insert new nodes into the ring and copy / prune the data as necessary, which is a Good Thing).

You *can* manually assign the tokens and that can be used as a work-around, if you know what the distribution of your tokens is or is likely to be.

You can also construct your keys carefully such that the tokens are likely to be equally spaced within them (e.g. by using a hash of something for the first part of your key).

Other clustered databases (e.g. Hadoop-based things possibly?) split the data into chunks which then get distributed among the nodes on some load-balanced basis; Cassandra does not do this yet.