cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Hamer <stephen.ha...@gmail.com>
Subject TimeOutExceptions and Cluster Performance
Date Sat, 13 Feb 2010 01:40:19 GMT
Hi,
I'm running a 5 node Cassandra cluster and am having a very tough time
getting reasonable performance from it. Many of the requests are failing
with TimeOutException. This is making it difficult to use Cassandra in a
production setting.

The cluster was running fine for a week or two (it was created 3 weeks ago)
but has started to degrade in the last week. The cluster was originally only
3 nodes but when performance started to degrade I added another two nodes.
This doesn't seem to have helped though.

Requests being made from the my application are being balanced across the
cluster in a round robin fashion. Many of these requests are failing with
TimeOutException. When the occurs I can look at the DB servers and several
of them fully utilizing 1 core. I can turn off my application when this is
going on (which stops all reads and writes to Cassandra). The cluster seems
to stay in this state for another several hour before returning to a resting
state.

When the CPU is loaded I see lots of messages about en-queuing, sorting, and
writing memtables so I have tried adjusting the memtable size down to 16MB
and raised the MemtableFlushAfterMinutes to 1440. This doesn't seem to have
affected anything though.

I was seeing errors about too many file descriptors being open so I added
“ulimit –n 32768” to Cassandra.in.sh. This seems to fixed this. I was also
seeing lots of out of memory exceptions so I raised the heap size to 4GB.
This has helped but not eliminated the OOM issues.

I'm not sure if it's related to any of the performance issues but I see lots
of log entries about DigestMismatchExceptions. I've included a sample of the
exceptions below.

My Cassandra cluster is almost unusable in its current state because of the
number of timeout exceptions that I'm seeing. I suspect that this is because
of a configuration or I have improperly set something up. It feels like the
database has entered a bad state which is causing it to churn as much as it
is but have no way to verify this.

What steps can I take to address the performance issues I am seeing and the
consistent stream of TimeOutExceptions?

Thanks,
Stephen


Here are some specifics about the cluster configuration:

5 Large EC2 instances - 7.5 GB ram, 2 cores (64bit, 1-1.2Ghz), data and
commit logs stored on separate EBS volumes. Boxes are running Debian 5.

root@prod-cassandra4 ~/cassandra # bin/nodeprobe -host localhost ring
Address       Status     Load          Range
     Ring


101279862673517536112907910111793343978
10.254.55.191 Up         2.94 GB       27246729060092122727944947571993545
     |<--|
10.214.119.127Up         3.67 GB
34209800341332764076889844611182786881     |   ^
10.215.122.208Up         11.86 GB
 42649376116143870288751410571644302377     v   |
10.215.30.47  Up         6.37 GB
81374929113514034361049243620869663203     |   ^
10.208.246.160Up         5.15 GB
101279862673517536112907910111793343978    |-->|


I am running the 0.5 release of Cassandra (at commit 44e8c2e...). Here are
some of my configuration options:

Memory, disk, performance section of storage-conf.xml (I've only included
options that I've changed from the defaults):
<Partitioner>org.apache.cassandra.dht.RandomPartitioner</Partitioner>
<ReplicationFactor>3</ReplicationFactor>

<SlicedBufferSizeInKB>512</SlicedBufferSizeInKB>
<FlushDataBufferSizeInMB>64</FlushDataBufferSizeInMB>
<FlushIndexBufferSizeInMB>16</FlushIndexBufferSizeInMB>
<ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>
<MemtableSizeInMB>16</MemtableSizeInMB>
<MemtableObjectCountInMillions>0.1</MemtableObjectCountInMillions>
<MemtableFlushAfterMinutes>1440</MemtableFlushAfterMinutes>
<ConcurrentReads>8</ConcurrentReads>
<ConcurrentWrites>32</ConcurrentWrites>
<CommitLogSync>periodic</CommitLogSync>
<CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>
<GCGraceSeconds>864000</GCGraceSeconds>
<BinaryMemtableSizeInMB>128</BinaryMemtableSizeInMB>


interesting bits of cassandra.in.sh:
ulimit -n 32768
JVM_OPTS=" \
        -ea \
        -Xdebug \
        -Xrunjdwp:transport=dt_socket,server=y,address=8888,suspend=n \
        -Xms512M \
        -Xmx4G \
        -XX:SurvivorRatio=8 \
        -XX:TargetSurvivorRatio=90 \
        -XX:+AggressiveOpts \
        -XX:+UseParNewGC \
        -XX:+UseConcMarkSweepGC \
        -XX:+CMSParallelRemarkEnabled \
        -XX:SurvivorRatio=128 \
        -XX:MaxTenuringThreshold=0 \
        -Dcom.sun.management.jmxremote.port=8080 \
        -Dcom.sun.management.jmxremote.ssl=false \
        -Dcom.sun.management.jmxremote.authenticate=false"


Sample of DigestMismatchExceptions:
INFO - DigestMismatchException: Mismatch for key 289
(7d820ce745ace086c82270ed05218d97 vs d41d8cd98f00b204e9800998ecf8427e)
INFO - DigestMismatchException: Mismatch for key 289
(f08c5ee5e159db5d482486e26fe8a549 vs d41d8cd98f00b204e9800998ecf8427e)
INFO - DigestMismatchException: Mismatch for key 289
(f4ab9d859f03a0416a78a1cf5a94a701 vs d41d8cd98f00b204e9800998ecf8427e)
INFO - DigestMismatchException: Mismatch for key 289
(a9d631de3f56f135918b008c186b75ac vs d41d8cd98f00b204e9800998ecf8427e)
INFO - DigestMismatchException: Mismatch for key 289
(33fdf46e0a897b19da204584e42b5d43 vs d41d8cd98f00b204e9800998ecf8427e)

Mime
View raw message