Hi,

We're running a small Cassandra cluster (1.1.4) with two nodes and serving data to our Web and Java application. After up-gradation of Cassandra from 1.0.8 to 1.1.4, we're starting to see some weird issues.

If we run 'ring' command from second node, its show that failed to connect 7199 of node 1.

$ /opt/apache-cassandra-1.1.4/bin/nodetool -h XX.XX.XX.01  ring
Failed to connect to 'XX.XX.XX.01:7199': Connection refused


We're using Network Monitoring System and Monit to monitor the servers, and in NMS the average CPU usage is around increased upto 500%, on our quad-core Xeon servers with 16 GB RAM. But occasionally through Monit we can see that the 1-min load average goes above 7. Is this common? Does this happen to everyone else? And why the spikiness in load? We can't find anything in the cassandra logs indicating that something's up (such as a slow GC or compaction), and there's no corresponding traffic spike in the application either. Should we just add more nodes if any single one gets CPU spikes?

Another explanation could also be that we've configured it wrong. We're running pretty much default config and each node has 16G of RAM.

A single keyspace with 15 to 20 column families, RF=2, and we have 260 GB of actual data. Please find below top and I/O stats for further reference;

top - 14:21:51 up 29 days,  9:52,  1 user,  load average: 6.59, 3.16, 1.42
Tasks: 163 total,   2 running, 161 sleeping,   0 stopped,   0 zombie
Cpu0  : 29.0%us,  0.0%sy,  0.0%ni, 71.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 28.0%us,  0.0%sy,  0.0%ni, 72.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 13.3%us,  0.0%sy,  0.0%ni, 86.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 23.5%us,  0.7%sy,  0.0%ni, 75.5%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu4  : 89.4%us,  0.3%sy,  0.0%ni, 10.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu5  : 29.2%us,  0.0%sy,  0.0%ni, 70.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 25.1%us,  0.0%sy,  0.0%ni, 74.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 24.3%us,  0.0%sy,  0.0%ni, 72.0%id,  0.0%wa,  2.3%hi,  1.3%si,  0.0%st
Mem:  16427844k total, 16317416k used,   110428k free,   128824k buffers
Swap:        0k total,        0k used,        0k free, 11344696k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                   
 5284 root      18   0  265g 7.7g 3.6g S 266.6 49.0 474:24.38 java -ea -javaagent:/opt/apache-cassandra-1.1.4/bin/../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:Thr
    1 root      15   0 10368  660  548 S  0.0  0.0   0:01.64 init [3]                                                                                                  

# iostat -xmn 2 10
-x and -n options are mutually exclusive

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.77    0.03    0.54    0.98    0.00   88.68

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.59     3.97  5.54  0.42     0.20     0.02    75.52     0.11   19.10   3.55   2.11
sda1              0.00     0.00  0.01  0.00     0.00     0.00    88.69     0.00    1.36   1.31   0.00
sda2              0.59     3.97  5.53  0.42     0.20     0.02    75.51     0.11   19.12   3.55   2.11
sdb               1.54     7.82 10.39  0.64     0.28     0.03    57.77     0.36   32.61   4.27   4.70
sdb1              1.54     7.82 10.39  0.64     0.28     0.03    57.77     0.36   32.61   4.27   4.70
dm-0              0.00     0.00  1.73  0.62     0.02     0.00    19.27     0.02    6.75   0.90   0.21
dm-1              0.00     0.00 16.32 12.23     0.46     0.05    36.47     0.50   17.67   2.07   5.92
dm-2              0.00     0.00  0.00  0.00     0.00     0.00     8.00     0.00    7.10   3.41   0.00

Device:                   rMB_nor/s    wMB_nor/s    rMB_dir/s    wMB_dir/s    rMB_svr/s    wMB_svr/s     ops/s    rops/s    wops/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.46    0.00    0.00    0.19    0.00   87.35

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     2.50  0.00  1.00     0.00     0.01    28.00     0.00    0.00   0.00   0.00
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     2.50  0.00  1.00     0.00     0.01    28.00     0.00    0.00   0.00   0.00
sdb               0.00     4.50  0.50  1.50     0.00     0.02    28.00     0.01    6.00   6.00   1.20
sdb1              0.00     4.50  0.50  1.50     0.00     0.02    28.00     0.01    6.00   6.00   1.20
dm-0              0.00     0.00  0.50  4.50     0.00     0.02     8.80     0.04    8.00   2.40   1.20
dm-1              0.00     0.00  0.00  5.00     0.00     0.02     8.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device:                   rMB_nor/s    wMB_nor/s    rMB_dir/s    wMB_dir/s    rMB_svr/s    wMB_svr/s     ops/s    rops/s    wops/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.52    0.00    0.00    0.00    0.00   87.48

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-0              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device:                   rMB_nor/s    wMB_nor/s    rMB_dir/s    wMB_dir/s    rMB_svr/s    wMB_svr/s     ops/s    rops/s    wops/s


Please help us to improve performance of Cassandra cluster as well as fix all issues.
--


Thanks & Regards

Adeel Akbar