incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oren Benjamin <o...@clearspring.com>
Subject Cassandra benchmarking on Rackspace Cloud
Date Fri, 16 Jul 2010 23:06:50 GMT
I've been doing quite a bit of benchmarking of Cassandra in the cloud using stress.py  I'm
working on a comprehensive spreadsheet of results with a template that others can add to,
but for now I thought I'd post some of the basic results here to get some feedback from others.

The first goal was to reproduce the test described on spyced here: http://spyced.blogspot.com/2010/01/cassandra-05.html

Using Cassandra 0.6.3, a 4GB/160GB cloud server (http://www.rackspacecloud.com/cloud_hosting_products/servers/pricing)
with default storage-conf.xml and cassandra.in.sh, here's what I got:

Reads: 4,800/s
Writes: 9,000/s

Pretty close to the result posted on the blog, with a slightly lower write performance (perhaps
due to the availability of only a single disk for both commitlog and data).

That was with 1M keys (the blog used 700K).

As number of keys scale read performance degrades as would be expected with no caching:
1M 4,800 reads/s
10M 4,600 reads/s
25M 700 reads/s
100M 200 reads/s

Using row cache and an appropriate choice of --stdev to achieve a cache hit rate of >90%
restores the read performance to the 4,800 reads/s level in all cases.  Also as expected,
write performance is unaffected by writing more data.

Scaling:
The above was single node testing.  I'd expect to be able to add nodes and scale throughput.
 Unfortunately, I seem to be running into a cap of 21,000 reads/s regardless of the number
of nodes in the cluster.  In order to better understand this, I eliminated the factors of
data size, caching, replication etc. and ran read tests on empty clusters (every read a miss
- bouncing off the bloom filter and straight back).  1 node gives 24,000 reads/s while 2,3,4...
give 21,000 (presumably the bump in single node performance is due to the lack of the extra
hop).  With CPU, disk, and RAM all largely unused, I'm at a loss to explain the lack of additional
throughput.  I tried increasing the number of clients but that just split the throughput down
the middle with each stress.py achieving roughly 10,000 reads/s.  I'm running the clients
(stress.py) on separate cloud servers.

I checked the ulimit file count and I'm not limiting connections there.  It seems like there's
a problem with my test setup - a clear bottleneck somewhere, but I just don't see what it
is.  Any ideas?

Also: 
The disk performance of the cloud servers have been extremely spotty.  The results I posted
above were reproducible whenever the servers were in their "normal" state.  But for periods
of as much as several consecutive hours, single servers or groups of servers in the cloud
would suddenly have horrendous disk performance as measured by dstat and iostat.  The "% steal"
by hypervisor on these nodes is also quite high (> 30%).   The performance during these
"bad" periods drops from 4,800reads/s in the single node benchmark to just 200reads/s.  The
node is effectively useless.  Is this normal for the cloud?  And if so, what's the solution
re Cassandra?  Obviously you can just keep adding more nodes until the likelihood that there
is at least one good server with every piece of the data is reasonable.  However, Cassandra
routes to the nearest node topologically and not to the best performing one, so "bad" nodes
will always result in high latency reads.  How are you guys that are running in the cloud
dealing with this?  Are you seeing this at all?

Thanks in advance for your feedback and advice,

   -- Oren
Mime
View raw message