incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Schuller <>
Subject Re: Cassandra performance
Date Wed, 15 Sep 2010 07:56:16 GMT
> But to be honest I'm pretty disappointed that Cassandra doesn't really
> scale linearly (or "semi-linearly" :)) when adding new machines. I

It really should scale linearly for this workload unless I have missed
something important (in which case I hope someone will chime in). But
note that you added more nodes and increased replication factor at the
same time so the discrepancy you're seeing is lower than it might
first appear. I.e., you got 200/sec with one machine @ rf =1 and
450/sec with 8 machines @ rf = 2. Given an 8x increase in machine
count and a 2x rf increase, the expectation would be 4x the read rate.

Why you're seeing 450 rather than something like 800 I'm not sure
though (with disk access and caching though, beware of the difficulty
of normalizing the environment when benchmarking). But whatever is
going on I don't believe you can draw the conclusion that this is due
to cassandra scaling that poorly for simple randomly distributed small

For a read, the nodes involved in servicing your requests are going to
be the limited to node you're talking to for RPC + RF number of nodes
(assuming read-repair is turned on, and that the RPC node did not
happen to be one of the nodes having the data). This really should
imply linear scaling (with respect to disk I/O, in the absence of
other bottlenecks).

Also, you can turn read-repair off (in 0.6) or partially off (in 0.7,
by percentage) if you are concerned with scaling with higher RF:s and
small number of nodes.

> expected that 8-machines cluster will easily beat single MySQL when
> there is much more data than RAM.

The relative performance characteristics in this case will be
significantly dependent on the type of data; it is not just about the
total amount. In particular the average row size is likely to be very
relevant. Access pattern also matters; for example, "random access"
within rows to different columns or column ranges have the potential
to be very much efficient, while random access between rows doing only
a single read for each row is probably the least flattering case for
Cassandra when disk bound.

Without knowing more details it's probably difficult to offer specific
explanations for this particular case.

/ Peter Schuller

View raw message