Just guessing here, but your mem table settings are very high compared to the defaults (64MB and 0.3 Million). It may be that flushing 1GB of mem table data is taking a while. 

You may also want to look at the  FlushDataBufferSizeInMB and FlushIndexBufferSizeInMB settings in storage-conf.xml.

Is the machine swapping at the some time? Have you tried using standard IO?

I could not access the file you linked to. 

Aaron



On 04 Aug, 2010,at 02:46 PM, Mingfan Lu <mingfan.lu@gmail.com> wrote:

Hi,
I have a 4-node cassandra cluster. And I find when the 4 nodes are
flushing memtable and gc at the very similar moment, the throughput
will drop and latency will increase rapidly and the nodes are dead and
up frequently ....
You could download the IOPS variance of data disk (sda here) and
system logs of these nodes from
http://docs.google.com/leaf?id=0ByKuS81H5x1VYThjOWQxMTQtMzEzMC00NDJiLTlhYWEtNzBjYzFmYTI3ZTk2&sort=name&layout=list&num=50
(if you can't download it, just tell me.)
What happed to the cluster?
How could I avoid such scenario?
* Storage configuration
All of nodes act as seed node
Random partitioner is used, so that the data is evenly located in
the 4 nodes
memtable thresholds:
DiskAccessMode?: auto (in fact is mmap)
MemtableThroughputInMB: 1024
MemtableOperationsInMillions?: 7
MemtableFlushAfterMinutes?: 1440
DiskAccess mode: Auto (mmap in fact)
* While JVM options are:
JVM_OPTS="-ea \
-Xms8G \
-Xmx8G \
-XX:+UseParNewGC \
-XX:+UseConcMarkSweepGC \
-XX:+CMSParallelRemarkEnabled \
-XX:SurvivorRatio=8 \
-XX:+UseLargePages \
-XX:LargePageSizeInBytes=2m \
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC -Xloggc:/tmp/cloudstress/jvm.gc.log \
-XX:MaxTenuringThreshold=1 \
-XX:+HeapDumpOnOutOfMemoryError \
-Dcom.sun.management.jmxremote.port=8080 \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.authenticate=false"