incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Anastasjev <olega...@gmail.com>
Subject Cassandra cluster does not tolerate single node failure
Date Wed, 07 Apr 2010 14:18:58 GMT
Hello,

I am doing some tests of cassandra clsuter behavior on several failure
scenarios. And i am stuck woith the very 1st test - what happens, if 1 node of
cluster becomes unavailable. 
I have 4 4gb nodes loaded with write mostly test. Normally it works at the rate
about 12000 ops/second. Replication Factor is 2. 
After a while, I shutdown node 4. And whole cluster's performance drops down to
60 (yes, two hundred times slower!) ops per second. I checked this on both 0.5.0
and 0.5.1 versions. 

This is my ring after shutdown:
Address       Status     Load          Range                                   
  Ring
                                       127605887595351923798765477786913079293
62.85.54.46   Up         119.03 MB     0                                       
  |<--|
62.85.54.47   Up         118.76 MB     42535295865117307932921825928971026431  
  |   |
62.85.54.48   Up         103.95 MB     85070591730234615865843651857942052862  
  |   |
62.85.54.49   Down       0 bytes       127605887595351923798765477786913079293 
  |-->|


After doing a bit of investigation, i found, that 62.85.54.46 and 62.85.54.47
started to starve in row mutation stage:
46:
ROW-MUTATION-STAGE               32       313        1875089
47:
ROW-MUTATION-STAGE               32      3042        1872123
but 48 is not:
ROW-MUTATION-STAGE                0         0        1668532

All these mutations go to HintsColumnFamily -
cfstats shows actility in this CF only for 46 and 47 nodes:
Keyspace: system
        Read Count: 0
        Read Latency: NaN ms.
        Write Count: 4953
        Write Latency: 386.766 ms.
        Pending Tasks: 0
                Column Family: LocationInfo
                Memtable Columns Count: 0
                Memtable Data Size: 0
                Memtable Switch Count: 1
                Read Count: 0
                Read Latency: NaN ms.
                Write Count: 0
                Write Latency: NaN ms.
                Pending Tasks: 0

                Column Family: HintsColumnFamily
                Memtable Columns Count: 173506
                Memtable Data Size: 1648344
                Memtable Switch Count: 1
                Read Count: 0
                Read Latency: NaN ms.
                Write Count: 4954
                Write Latency: 387.473 ms.
                Pending Tasks: 0
please note enormously slow write latency.

Interesting, that issuing "nodeprobe flush system" command to 46 and 47 nodes
speedup processing for a short period of time, but then it quickly returns bakc
to 66 ops/second.

I suspect, that these nodes create very much subcolumns in supercolumn of CF
HintsColumnFamily in memory table. 

What can i do to have cassandra cluster to tolerate single node failure better ?






Mime
View raw message