incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: Cassandra cluster does not tolerate single node failure
Date Wed, 07 Apr 2010 15:23:48 GMT
This is a known problem with 0.5 that was addressed in 0.6.

On Wed, Apr 7, 2010 at 9:18 AM, Oleg Anastasjev <oleganas@gmail.com> wrote:
> Hello,
>
> I am doing some tests of cassandra clsuter behavior on several failure
> scenarios. And i am stuck woith the very 1st test - what happens, if 1 node of
> cluster becomes unavailable.
> I have 4 4gb nodes loaded with write mostly test. Normally it works at the rate
> about 12000 ops/second. Replication Factor is 2.
> After a while, I shutdown node 4. And whole cluster's performance drops down to
> 60 (yes, two hundred times slower!) ops per second. I checked this on both 0.5.0
> and 0.5.1 versions.
>
> This is my ring after shutdown:
> Address       Status     Load          Range
>  Ring
>                                       127605887595351923798765477786913079293
> 62.85.54.46   Up         119.03 MB     0
>  |<--|
> 62.85.54.47   Up         118.76 MB     42535295865117307932921825928971026431
>  |   |
> 62.85.54.48   Up         103.95 MB     85070591730234615865843651857942052862
>  |   |
> 62.85.54.49   Down       0 bytes       127605887595351923798765477786913079293
>  |-->|
>
>
> After doing a bit of investigation, i found, that 62.85.54.46 and 62.85.54.47
> started to starve in row mutation stage:
> 46:
> ROW-MUTATION-STAGE               32       313        1875089
> 47:
> ROW-MUTATION-STAGE               32      3042        1872123
> but 48 is not:
> ROW-MUTATION-STAGE                0         0        1668532
>
> All these mutations go to HintsColumnFamily -
> cfstats shows actility in this CF only for 46 and 47 nodes:
> Keyspace: system
>        Read Count: 0
>        Read Latency: NaN ms.
>        Write Count: 4953
>        Write Latency: 386.766 ms.
>        Pending Tasks: 0
>                Column Family: LocationInfo
>                Memtable Columns Count: 0
>                Memtable Data Size: 0
>                Memtable Switch Count: 1
>                Read Count: 0
>                Read Latency: NaN ms.
>                Write Count: 0
>                Write Latency: NaN ms.
>                Pending Tasks: 0
>
>                Column Family: HintsColumnFamily
>                Memtable Columns Count: 173506
>                Memtable Data Size: 1648344
>                Memtable Switch Count: 1
>                Read Count: 0
>                Read Latency: NaN ms.
>                Write Count: 4954
>                Write Latency: 387.473 ms.
>                Pending Tasks: 0
> please note enormously slow write latency.
>
> Interesting, that issuing "nodeprobe flush system" command to 46 and 47 nodes
> speedup processing for a short period of time, but then it quickly returns bakc
> to 66 ops/second.
>
> I suspect, that these nodes create very much subcolumns in supercolumn of CF
> HintsColumnFamily in memory table.
>
> What can i do to have cassandra cluster to tolerate single node failure better ?
>
>
>
>
>
>

Mime
View raw message