Dear Aaron,

Thank you so much for your help to resolve this issue. I have found that READ Latency increased too much. Please find attached OpsCentre graphs for further clarification;




Thanks & Regards

Adeel Akbar

On 10/10/2012 12:26 AM, aaron morton wrote:
RF=2
I would recommend moving the RF 3, the QUOURM for 2 is 2. 

We can't find anything in the cassandra logs indicating that something's up (such as a slow GC or compaction), and there's no corresponding traffic spike in the application either
Does the CPU load correlate with compaction or repair times ?

The node is not waiting on IO and is using all the available CPU, which is a good thing. Have you seen an increase in latency ? 

Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 8/10/2012, at 10:25 PM, Adeel Akbar <adeel.akbar@panasiangroup.com> wrote:

Hi,

We're running a small Cassandra cluster (1.1.4) with two nodes and serving data to our Web and Java application. After up-gradation of Cassandra from 1.0.8 to 1.1.4, we're starting to see some weird issues.

If we run 'ring' command from second node, its show that failed to connect 7199 of node 1.

$ /opt/apache-cassandra-1.1.4/bin/nodetool -h XX.XX.XX.01  ring
Failed to connect to 'XX.XX.XX.01:7199': Connection refused


We're using Network Monitoring System and Monit to monitor the servers, and in NMS the average CPU usage is around increased upto 500%, on our quad-core Xeon servers with 16 GB RAM. But occasionally through Monit we can see that the 1-min load average goes above 7. Is this common? Does this happen to everyone else? And why the spikiness in load? We can't find anything in the cassandra logs indicating that something's up (such as a slow GC or compaction), and there's no corresponding traffic spike in the application either. Should we just add more nodes if any single one gets CPU spikes?

Another explanation could also be that we've configured it wrong. We're running pretty much default config and each node has 16G of RAM.

A single keyspace with 15 to 20 column families, RF=2, and we have 260 GB of actual data. Please find below top and I/O stats for further reference;

top - 14:21:51 up 29 days,  9:52,  1 user,  load average: 6.59, 3.16, 1.42
Tasks: 163 total,   2 running, 161 sleeping,   0 stopped,   0 zombie
Cpu0  : 29.0%us,  0.0%sy,  0.0%ni, 71.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 28.0%us,  0.0%sy,  0.0%ni, 72.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 13.3%us,  0.0%sy,  0.0%ni, 86.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 23.5%us,  0.7%sy,  0.0%ni, 75.5%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu4  : 89.4%us,  0.3%sy,  0.0%ni, 10.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu5  : 29.2%us,  0.0%sy,  0.0%ni, 70.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 25.1%us,  0.0%sy,  0.0%ni, 74.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 24.3%us,  0.0%sy,  0.0%ni, 72.0%id,  0.0%wa,  2.3%hi,  1.3%si,  0.0%st
Mem:  16427844k total, 16317416k used,   110428k free,   128824k buffers
Swap:        0k total,        0k used,        0k free, 11344696k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                   
 5284 root      18   0  265g 7.7g 3.6g S 266.6 49.0 474:24.38 java -ea -javaagent:/opt/apache-cassandra-1.1.4/bin/../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:Thr
    1 root      15   0 10368  660  548 S  0.0  0.0   0:01.64 init [3]                                                                                                  

# iostat -xmn 2 10
-x and -n options are mutually exclusive

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.77    0.03    0.54    0.98    0.00   88.68

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.59     3.97  5.54  0.42     0.20     0.02    75.52     0.11   19.10   3.55   2.11
sda1              0.00     0.00  0.01  0.00     0.00     0.00    88.69     0.00    1.36   1.31   0.00
sda2              0.59     3.97  5.53  0.42     0.20     0.02    75.51     0.11   19.12   3.55   2.11
sdb               1.54     7.82 10.39  0.64     0.28     0.03    57.77     0.36   32.61   4.27   4.70
sdb1              1.54     7.82 10.39  0.64     0.28     0.03    57.77     0.36   32.61   4.27   4.70
dm-0              0.00     0.00  1.73  0.62     0.02     0.00    19.27     0.02    6.75   0.90   0.21
dm-1              0.00     0.00 16.32 12.23     0.46     0.05    36.47     0.50   17.67   2.07   5.92
dm-2              0.00     0.00  0.00  0.00     0.00     0.00     8.00     0.00    7.10   3.41   0.00

Device:                   rMB_nor/s    wMB_nor/s    rMB_dir/s    wMB_dir/s    rMB_svr/s    wMB_svr/s     ops/s    rops/s    wops/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.46    0.00    0.00    0.19    0.00   87.35

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     2.50  0.00  1.00     0.00     0.01    28.00     0.00    0.00   0.00   0.00
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     2.50  0.00  1.00     0.00     0.01    28.00     0.00    0.00   0.00   0.00
sdb               0.00     4.50  0.50  1.50     0.00     0.02    28.00     0.01    6.00   6.00   1.20
sdb1              0.00     4.50  0.50  1.50     0.00     0.02    28.00     0.01    6.00   6.00   1.20
dm-0              0.00     0.00  0.50  4.50     0.00     0.02     8.80     0.04    8.00   2.40   1.20
dm-1              0.00     0.00  0.00  5.00     0.00     0.02     8.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device:                   rMB_nor/s    wMB_nor/s    rMB_dir/s    wMB_dir/s    rMB_svr/s    wMB_svr/s     ops/s    rops/s    wops/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12.52    0.00    0.00    0.00    0.00   87.48

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-0              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device:                   rMB_nor/s    wMB_nor/s    rMB_dir/s    wMB_dir/s    rMB_svr/s    wMB_svr/s     ops/s    rops/s    wops/s


Please help us to improve performance of Cassandra cluster as well as fix all issues.
--


Thanks & Regards

Adeel Akbar