incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rishi Bhardwaj <>
Subject Re: Cassandra Write Performance, CPU usage
Date Fri, 11 Jun 2010 04:05:45 GMT
Hi Jonathan

Thanks for such an informative reply. My application may end up doing such continuous bulk
writes to Cassandra and thus I was interested in such a performance case. I was wondering
as to what are all the CPU overheads for each row/column written to Cassandra? You mentioned
updating of bloom filters, would that be the main CPU overhead, there may even be copying
of data happening? I want to investigate about all the factors in play here and if there is
a possibility for improvement. Is it possible to profile cassandra and see what maybe the
bottleneck here. The auxiliary I/O you had mentioned for the Bloom filters, wouldn't that
occur with the I/O for the SSTable, in which case the extra I/O for the bloom filter gets
piggybacked with the SSTable I/O? I guess I don't understand the Cassandra internals too well
but wanted to see how much can Cassandra achieve for continuous bulk writes.

Has anyone done any bulk write experiments with Cassandra? Is Cassandra performance always
expected to be bottlenecked by CPU when doing continuous bulk writes?

Thanks for all the help,

From: Jonathan Shook <>
Sent: Thu, June 10, 2010 7:39:24 PM
Subject: Re: Cassandra Write Performance, CPU usage

You are testing Cassandra in a way that it was not designed to be used.
Bandwidth to disk is not a meaningful example for nearly anything
except for filesystem benchmarking and things very nearly the same as
filesystem benchmarking.
Unless the usage patterns of your application match your test data,
there is not a good reason to expect a strong correlation between this
test and actual performance.

Cassandra is not simply shuffling data through IO when you write.
There are calculations that have to be done as writes filter their way
through various stages of processing. The point of this is to minimize
the overall effort Cassandra has to make in order to retrieve the data
again. One example would be bloom filters. Each column that is written
requires bloom filter processing and potentially auxiliary IO. Some of
these steps are allowed to happen in the background, but if you try,
you can cause them to stack up on top of the available CPU and memory

In such a case (continuous bulk writes), you are causing all of these
costs to be taken in more of a synchronous (not delayed) fashion. You
are not allowing the background processing that helps reduce client
blocking (by deferring some processing) to do its magic.

On Thu, Jun 10, 2010 at 7:42 PM, Rishi Bhardwaj <> wrote:
> Hi
> I am investigating Cassandra write performance and see very heavy CPU usage
> from Cassandra. I have a single node Cassandra instance running on a dual
> core (2.66 Ghz Intel ) Ubuntu 9.10 server. The writes to Cassandra are being
> generated from the same server using BatchMutate(). The client makes exactly
> one RPC call at a time to Cassandra. Each BatchMutate() RPC contains 2 MB of
> data and once it is acknowledged by Cassandra, the next RPC is done.
> Cassandra has two separate disks, one for commitlog with a sequential b/w of
> 130MBps and the other a solid state disk for data with b/w of 90MBps. Tuning
> various parameters, I observe that I am able to attain a maximum write
> performance of about 45 to 50 MBps from Cassandra. I see that the Cassandra
> java process consistently uses 100% to 150% of CPU resources (as shown by
> top) during the entire write operation. Also, iostat clearly shows that the
> max disk bandwidth is not reached anytime during the write operation, every
> now and then the i/o activity on "commitlog" disk and the data disk spike
> but it is never consistently maintained by cassandra close to their peak. I
> would imagine that the CPU is probably the bottleneck here. Does anyone have
> any idea why Cassandra beats the heck out of the CPU here? Any suggestions
> on how to go about finding the exact bottleneck here?
> Some more information about the writes: I have 2 column families, the data
> though is mostly written in one column family with column sizes of around
> 32k and each row having around 256 or 512 columns. I would really appreciate
> any help here.
> Thanks,
> Rishi

View raw message