cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hiroyuki Yamada <mogwa...@gmail.com>
Subject A cluster (RF=3) not recovering after two nodes are stopped
Date Wed, 24 Apr 2019 06:27:58 GMT
Hello,

I faced a weird issue when recovering a cluster after two nodes are stopped.
It is easily reproduce-able and looks like a bug or an issue to fix,
so let me write down the steps to reproduce.

=== STEPS TO REPRODUCE ===
* Create a 3-node cluster with RF=3
   - node1(seed), node2, node3
* Start requests to the cluster with cassandra-stress (it continues
until the end)
   - what we did: cassandra-stress mixed cl=QUORUM duration=10m
-errors ignore -node node1,node2,node3 -rate threads\>=16
threads\<=256
* Stop node3 normally (with systemctl stop)
   - the system is still available because the quorum of nodes is
still available
* Stop node2 normally (with systemctl stop)
   - the system is NOT available after it's stopped.
   - the client gets `UnavailableException: Not enough replicas
available for query at consistency QUORUM`
   - the client gets errors right away (so few ms)
   - so far it's all expected
* Wait for 1 mins
* Bring up node2
   - The issue happens here.
   - the client gets ReadTimeoutException` or WriteTimeoutException
depending on if the request is read or write even after the node2 is
up
   - the client gets errors after about 5000ms or 2000ms, which are
request timeout for write and read request
   - what node1 reports with `nodetool status` and what node2 reports
are not consistent. (node2 thinks node1 is down)
   - It takes very long time to recover from its state
=== STEPS TO REPRODUCE ===

Is it supposed to happen ?
If we don't start cassandra-stress, it's all fine.

Some workarounds we found to recover the state are the followings:
* Restarting node1 and it recovers its state right after it's restarted
* Setting lower value in dynamic_snitch_reset_interval_in_ms (to 60000
or something)

I don't think either of them is a really good solution.
Can anyone explain what is going on and what is the best way to make
it not happen or recover ?

Thanks,
Hiro

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org


Mime
View raw message