cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Mena <pm...@whoi.edu>
Subject RE: Cassandra is not showing a node up hours after restart
Date Fri, 06 Dec 2019 19:49:49 GMT
As we are still without a functional Cassandra cluster in our development environment, I thought
I’d try restarting the same node (one of 4 in the cluster) with the following command:

ip=$(cat /etc/hostname); nodetool disablethrift && nodetool disablebinary &&
sleep 5 && nodetool disablegossip && nodetool drain && sleep 10 &&
sudo service cassandra restart && until echo "SELECT * FROM system.peers LIMIT 1;"
| cqlsh $ip > /dev/null 2>&1; do echo "Node $ip is still DOWN"; sleep 10; done &&
echo "Node $ip is now UP"

The above command returned “Node is now UP” after about 40 seconds, confirmed on “node001”
via “nodetool status”:

user@node001=> nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens  Owns    Host ID                               Rack
UN  192.168.187.121  539.43 GB  256     ?       c99cf581-f4ae-4aa9-ab37-1a114ab2429b  rack1
UN  192.168.187.122  633.92 GB  256     ?       bfa07f47-7e37-42b4-9c0b-024b3c02e93f  rack1
UN  192.168.187.123  576.31 GB  256     ?       273df9f3-e496-4c65-a1f2-325ed288a992  rack1
UN  192.168.187.124  628.5 GB   256     ?       b8639cf1-5413-4ece-b882-2161bbb8a9c3  rack1

As was the case before, running “nodetool status” on any of the other nodes shows that
“node001” is still down:

user@node002=> nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address          Load       Tokens  Owns    Host ID                               Rack
DN  192.168.187.121  538.94 GB  256     ?       c99cf581-f4ae-4aa9-ab37-1a114ab2429b  rack1
UN  192.168.187.122  634.04 GB  256     ?       bfa07f47-7e37-42b4-9c0b-024b3c02e93f  rack1
UN  192.168.187.123  576.42 GB  256     ?       273df9f3-e496-4c65-a1f2-325ed288a992  rack1
UN  192.168.187.124  628.56 GB  256     ?       b8639cf1-5413-4ece-b882-2161bbb8a9c3  rack1

Is it inadvisable to continue with the rolling restart?

Paul Mena
Senior Application Administrator
WHOI - Information Services
508-289-3539

From: Shalom Sagges <shalomsagges@gmail.com>
Sent: Tuesday, November 26, 2019 12:59 AM
To: user@cassandra.apache.org
Subject: Re: Cassandra is not showing a node up hours after restart

Hi Paul,

From the gossipinfo output, it looks like the node's IP address and rpc_address are different.
/192.168.187.121 vs RPC_ADDRESS:192.168.185.121
You can also see that there's a schema disagreement between nodes, e.g. schema_id on node001
is fd2dcb4b-ca62-30df-b8f2-d3fd774f2801 and on node002 it is fd2dcb4b-ca62-30df-b8f2-d3fd774f2801.
You can run nodetool describecluster to see it as well.
So I suggest to change the rpc_address to the ip_address of the node or set it to 0.0.0.0
and it should resolve the issue.

Hope this helps!


On Tue, Nov 26, 2019 at 4:05 AM Inquistive allen <inquiallen@gmail.com<mailto:inquiallen@gmail.com>>
wrote:
Hello ,

Check and compare everything parameters

1. Java version should ideally match across all nodes in the cluster
2. Check if port 7000 is open between the nodes. Use telnet or nc commands
3. You must see some clues in system logs, why the gossip is failing.

Do confirm on the above things.

Thanks


On Tue, 26 Nov, 2019, 2:50 AM Paul Mena, <pmena@whoi.edu<mailto:pmena@whoi.edu>>
wrote:
NTP was restarted on the Cassandra nodes, but unfortunately I’m still getting the same result:
the restarted node does not appear to be rejoining the cluster.

Here’s another data point: “nodetool gossipinfo”, when run from the restarted node (“node001”)
shows a status of “normal”:

user@node001=> nodetool -u gossipinfo
/192.168.187.121<http://192.168.187.121>
  generation:1574364410
  heartbeat:209150
  NET_VERSION:8
  RACK:rack1
  STATUS:NORMAL,-104847506331695918
  RELEASE_VERSION:2.1.9
  SEVERITY:0.0
  LOAD:5.78684155614E11
  HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
  SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
  DC:datacenter1
  RPC_ADDRESS:192.168.185.121

When run from one of the other nodes, however, node001’s status is shown as “shutdown”:

user@node002=> nodetool gossipinfo
/192.168.187.121<http://192.168.187.121>
  generation:1491825076
  heartbeat:2147483647
  STATUS:shutdown,true
  RACK:rack1
  NET_VERSION:8
  LOAD:5.78679987693E11
  RELEASE_VERSION:2.1.9
  DC:datacenter1
  SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
  HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
  RPC_ADDRESS:192.168.185.121
  SEVERITY:0.0


Paul Mena
Senior Application Administrator
WHOI - Information Services
508-289-3539

From: Paul Mena
Sent: Monday, November 25, 2019 9:29 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: RE: Cassandra is not showing a node up hours after restart

I’ve just discovered that NTP is not running on any of these Cassandra nodes, and that the
timestamps are all over the map. Could this be causing my issue?

user@remote=> ansible pre-prod-cassandra -a date
node001.intra.myorg.org<http://node001.intra.myorg.org> | CHANGED | rc=0 >>
Mon Nov 25 13:58:17 UTC 2019

node004.intra.myorg.org<http://node004.intra.myorg.org> | CHANGED | rc=0 >>
Mon Nov 25 14:07:20 UTC 2019

node003.intra.myorg.org<http://node003.intra.myorg.org> | CHANGED | rc=0 >>
Mon Nov 25 13:57:06 UTC 2019

node001.intra.myorg.org<http://node001.intra.myorg.org> | CHANGED | rc=0 >>
Mon Nov 25 14:07:22 UTC 2019

Paul Mena
Senior Application Administrator
WHOI - Information Services
508-289-3539

From: Inquistive allen <inquiallen@gmail.com<mailto:inquiallen@gmail.com>>
Sent: Monday, November 25, 2019 2:46 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Cassandra is not showing a node up hours after restart

Hello team,

Just to add on to the discussion, one may run,
Nodetool disablebinary followed by a nodetool disablethrift followed by nodetool drain.
Nodetool drain also does the work of nodetool flush+ declaring in the cluster that I'm down
and not accepting traffic.

Thanks


On Mon, 25 Nov, 2019, 12:55 AM Surbhi Gupta, <surbhi.gupta01@gmail.com<mailto:surbhi.gupta01@gmail.com>>
wrote:
Before Cassandra shutdown, nodetool drain should be executed first. As soon as you do nodetool
drain, others node will see this node down and no new traffic will come to this node.
I generally gives 10 seconds gap between nodetool drain and Cassandra stop.

On Sun, Nov 24, 2019 at 9:52 AM Paul Mena <pmena@whoi.edu<mailto:pmena@whoi.edu>>
wrote:

Thank you for the replies. I had made no changes to the config before the rolling restart.



I can try another restart but was wondering if I should do it differently. I had simply done
"service cassandra stop" followed by "service cassandra start".  Since then I've seen some
suggestions to proceed the shutdown with "nodetool disablegossip" and/or "nodetool drain".
Are these commands advisable? Are any other commands recommended either before the shutdown
or after the startup?



Thanks again!



Paul

________________________________
From: Naman Gupta <naman.gupta@girnarsoft.com<mailto:naman.gupta@girnarsoft.com>>
Sent: Sunday, November 24, 2019 11:18:14 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Cassandra is not showing a node up hours after restart

Did you change the name of datacenter or any other config changes before the rolling restart?

On Sun, Nov 24, 2019 at 8:49 PM Paul Mena <pmena@whoi.edu<mailto:pmena@whoi.edu>>
wrote:
I am in the process of doing a rolling restart on a 4-node cluster running Cassandra 2.1.9.
I stopped and started Cassandra on node 1 via "service cassandra stop/start", and noted nothing
unusual in either system.log or cassandra.log. Doing a "nodetool status" from node 1 shows
all four nodes up:

user@node001=> nodetool status
Datacenter: datacenter1

=======================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address          Load       Tokens  Owns    Host ID                               Rack

UN  192.168.187.121  538.95 GB  256     ?       c99cf581-f4ae-4aa9-ab37-1a114ab2429b  rack1

UN  192.168.187.122  630.72 GB  256     ?       bfa07f47-7e37-42b4-9c0b-024b3c02e93f  rack1

UN  192.168.187.123  572.73 GB  256     ?       273df9f3-e496-4c65-a1f2-325ed288a992  rack1

UN  192.168.187.124  625.05 GB  256     ?       b8639cf1-5413-4ece-b882-2161bbb8a9c3  rack1
But doing the same command from any other of the 3 nodes shows node 1 still down.


user@node002=> nodetool status

Datacenter: datacenter1

=======================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address          Load       Tokens  Owns    Host ID                               Rack

DN  192.168.187.121  538.94 GB  256     ?       c99cf581-f4ae-4aa9-ab37-1a114ab2429b  rack1

UN  192.168.187.122  630.72 GB  256     ?       bfa07f47-7e37-42b4-9c0b-024b3c02e93f  rack1

UN  192.168.187.123  572.73 GB  256     ?       273df9f3-e496-4c65-a1f2-325ed288a992  rack1

UN  192.168.187.124  625.04 GB  256     ?       b8639cf1-5413-4ece-b882-2161bbb8a9c3  rack1
Is there something I can do to remedy this current situation - so that I can continue with
the rolling restart?

Mime
View raw message