On Nov 6, 2009, at 2:35 PM, Mark Robson wrote:

2009/11/6 Joe Stump <joe@joestump.net>

Can you explain what you mean by lack of load balancing?


Nothing in Cassandra attempts to ensure that your data are equally spread over the different nodes (yet; there are several bugs open to this effect).

That's not true from my understanding. It won't put three copies on the same node. The key word, I suppose, is "equally". 

If you use the OrderedPartitioner, in all likelihood your data will be very unevenly spread to the point where most of your servers aren't used at all. This obviously doesn't scale.

The RandomPartitioner is better because the hashing it does causes data to spread out, but the tokens are still chosen randomly so there's no way to guarantee that machines get equal or even similar(ish) amounts of data.

We've answered this by creating our own partitioners, which Cassandra makes pluggable. Took one of our guys about two full days to have something up and running. Also, there's no way to guarantee anything for the most part in distributed computing.

I think you're misleading people, though, with the notion that a. Cassandra doesn't have load balancing (it does, in many ways) and b. It doesn't scale. Digg and Facebook both use it in production and while it might not be battle hardened and fully tested, it's definitely working for them well under high load.

--Joe