From user-return-64409-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Thu Aug 29 16:57:31 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id F035F180608 for ; Thu, 29 Aug 2019 18:57:30 +0200 (CEST) Received: (qmail 75555 invoked by uid 500); 29 Aug 2019 16:57:28 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 75534 invoked by uid 99); 29 Aug 2019 16:57:27 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Aug 2019 16:57:27 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id C4392182BB5 for ; Thu, 29 Aug 2019 16:01:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.8 X-Spam-Level: * X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id is9E3Nj4oOpk for ; Thu, 29 Aug 2019 16:01:51 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.217.67; helo=mail-vs1-f67.google.com; envelope-from=arodrime@gmail.com; receiver= Received: from mail-vs1-f67.google.com (mail-vs1-f67.google.com [209.85.217.67]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 63A32BC808 for ; Thu, 29 Aug 2019 16:01:51 +0000 (UTC) Received: by mail-vs1-f67.google.com with SMTP id l63so2744057vsl.10 for ; Thu, 29 Aug 2019 09:01:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=FfmB7VtHxbvLgZO38lb0ru8FZcxXFeAoVcZuWpGQ8MM=; b=sEuAuS0jS/lqzX+S0v7oMym6Ozx7fx6QDHpHvjB8di9IfDYINIIsPdMHj50y82Aazf S6+1iP+ZoRoM8HBW+vP6bSsE1h0wcPnEPv5tcc11Bv+zkJ6BnzDWZYHkAyCDfGLfmuup p/GNwyD3p0yhwnDFKWDNp3kxq8X1eGNIEDiDnzG0Ndyyh+h6+tR3bfhbF0XFV/XQjPJf RFWbVqtH6y5rYcZ86ULPGpd3sJV1WJewk/WOuA0qI7w66MOGT6JhoLD7dnClya939ab3 ebd7Aq8dAtjkrnHR9ZHwRqlMn5dRnXhkyKX1ggg6k0uoKBaGXvacC8wIFZql26jnc+Pv nQ1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=FfmB7VtHxbvLgZO38lb0ru8FZcxXFeAoVcZuWpGQ8MM=; b=QWsBBlZmp0OUnob/JR/rHrcc0UxnzEKeLgz9JUAthonfXm6xusAxx90U28VreHnNGr bfwow3lVsTmVOeY8W9U26/MO4Do4AIkQbNhnn0IcwnDaOeQseiO4gv7cOr9za0OrnbxI m1u5VIn7tv3bNSUdLJjDHqLFnW4TPoSncIyg4V3mAXb089gk+ct1Yo3aTAOmO3bx9CwH kZmlMo8dnUExLTyVcSQMM7SmifSCzc0DPQx42oDLNY1OeFvWYbnkMy+0hI7dmxSir96i isUrDTpMhv6qqUZOI6m37fDzvRkxbpvByIFSNIvyp525MXpcEXXoROicoWsvy/hUBKnz HUEg== X-Gm-Message-State: APjAAAVHWk4T3z2BdxlVfsmltoSX4OaciczdFlVFafRqKSlYKBpU8oeg yOEQ2t5zkAPuwuakyq+fXxh5bmRYfOOGheQrt9D1JNneN7A= X-Google-Smtp-Source: APXvYqynjgHqA9nGtHLxHTUIgt45nZepydGCLDVNmiDkv1snJJ7oGQ2iI/TYdHA3+Gl3ob9+2t4JaguzYs0sZNb+MjI= X-Received: by 2002:a67:c119:: with SMTP id d25mr6419669vsj.48.1567094504902; Thu, 29 Aug 2019 09:01:44 -0700 (PDT) MIME-Version: 1.0 References: <072b7c9be4c04f1691c7f3352e15d879@metricly.com> <11c8a3617d1798d36f75046fd5d32fcd@aca-o.com> <5f913abb458a41fba625656131f41d0d@metricly.com> <000001d4eaff$f421e890$dc65b9b0$@yahoo.com> <000301d4eb05$9036eeb0$b0a4cc10$@yahoo.com> <000001d4eb07$0864a8e0$192dfaa0$@yahoo.com> <000001d4eb0c$1b0b0930$51211b90$@yahoo.com> <000401d4eb0d$5ea23d20$1be6b760$@yahoo.com> <000001d4eb14$4eacd590$ec0680b0$@yahoo.com> <983adc15551d28ce02452289d223ddcc@aca-o.com> In-Reply-To: <983adc15551d28ce02452289d223ddcc@aca-o.com> From: Alain RODRIGUEZ Date: Thu, 29 Aug 2019 17:01:33 +0100 Message-ID: Subject: Re: Assassinate fails To: "user cassandra.apache.org" Content-Type: multipart/alternative; boundary="0000000000004b535d0591439f67" --0000000000004b535d0591439f67 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello Alex, long time - I had to wait for a quiet week to try this. I finally did, I > thought I'd give you some feedback. Thanks for taking the time to share this, I guess it might be useful to some other people around to know the end of the story ;-). Glad this worked for you, C*heers, ----------------------- Alain Rodriguez - alain@thelastpickle.com France / Spain The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com Le ven. 16 ao=C3=BBt 2019 =C3=A0 08:16, Alex a =C3=A9crit : > Hello Alain, > > long time - I had to wait for a quiet week to try this. I finally did, I > thought I'd give you some feedback. > > Short reminder: one of the nodes of my 3.9 cluster died and I replaced it= . > But it still appeared in nodetool status, on one node with a "null" host_= id > and on another with the same host_id of its replacement. nodetool > assassinate failed and I could not decommission or remove any other node = on > the cluster. > > Basically, after backup and preparing another cluster in case anything > went wrong, I did : > > DELETE FROM system.peers WHERE peer =3D '192.168.1.18'; > > and restarted cassandra on the two nodes still seeing the zombie node. > > After the first restart, the cassandra system.log was filled with: > > java.lang.NullPointerException: null > WARN [MutationStage-2] 2019-08-15 15:31:44,735 > AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread > Thread[MutationStage-2,5,main]: > > So... I restarted again. The error disappeared. I ran a full repair and > everything seems to be back in order. I could decommission a node without > problem. > > Thanks for your help ! > > Alex > > > > > Le 05.04.2019 10:55, Alain RODRIGUEZ a =C3=A9crit : > > Alex, > > >> Well, I tried : rolling restart did not work its magic. > > > Sorry to hear and for misleading you. May faith into the rolling restart > magical power went down a bit, but I still think it was worth a try :D. > >> @ Alain : In system.peers I see both the dead node and its replacement >> with the same ID : >> peer | host_id >> --------------+-------------------------------------- >> 192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1 >> 192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1 >> >> Is it expected ? >> >> If I cannot fix this, I think I will add new nodes and remove, one by >> one, the nodes that show the dead node in nodetool status. >> > Well, no. This is clearly not good or expected I would say. > > *tl;dr - Suggested fix:* > What I would try to fix this is the following is removing this row. It > *should* be safe but that's only my opinion and with the condition you > remove *only* the 'ghost/dead' nodes. Any mistake here would probably be > costly. Again, be aware you're on a sensitive part when messing with syst= em > tables. Think it twice, check it twice, take a copy of the SSTables/a > snapshot. Then I would go for it and observe changes on one node first. I= f > no harm is done, continue to the next node. > > Considering the old node is '192.168.1.18', I would run this on all nodes > (maybe after testing on a node) to make it simple or just run it on nodes > that show the ghost node(s): > > *"DELETE FROM system.peers WHERE peer =3D '192.168.1.18';"* > > Maybe will you need to restart, I think you won't even need it. I have > good hope that this should finally fix your issue with no harm. > > *More context - Idea of the problem:* > This above, is clearly an issue I would say. Most probably the source of > your troubles here. The problem is that I lack understanding. From where = I > stand, this kind of bugs should not happen anymore in Cassandra (I did no= t > see anything similar for a while). > > I would blame: > - A corner case scenario (unlikely, system tables are rather solid for a > while). Or maybe are you using an old C* version. It *might* be related t= o > this (or similar): https://issues.apache.org/jira/browse/CASSANDRA-7122) > - A really weird operation (A succession of action might have put you in > this state, but hard for me to say what) > - KairosDB? I don't know It or what it does. Might it be less reliable > than Cassandra is, and have lead to this issue? Maybe, I have no clue onc= e > again. > > *Risk of this operation and current situation:* > Also, I *think* the current situation is relatively 'stable' (maybe just > some hints being stored for nothing, and possibly not being able to add > more nodes or change schema?). This is the kind of situation where > 'rushing' a solution without understanding the impacts and risks can make > things to go terribly wrong. Take the time to analyse my suggested fix, > maybe read the ticket above etc. When you're ready, backup the data, > prepare well the DELETE command and observe how 1 node reacts to the fix > first. > > As you can see, I think it's the 'good' fix, but I'm not comfortable with > this operation. And you should not be either :). > I would say, arbitrary to share my feeling about this operation, that > there is 95% chances this does not hurt, 90% chances to fix the issue wit= h > that, but if something goes wrong, if we are in the 5% were it does not g= o > well, there is a not negligible probability that you will destroy your > cluster in a very bad way. I guess I try to say be careful, watch your > step, make sure you remove the good line, ensure it works on one node wit= h > no harm. > I shared my feeling and I would try this fix. But it's ultimately > your responsibility and I won't be behind the machine when you'll fix it. > None of us will. > > Good luck ! :) > > C*heers, > ----------------------- > Alain Rodriguez - alain@thelastpickle.com > France / Spain > > The Last Pickle - Apache Cassandra Consulting > http://www.thelastpickle.com > > > > Le jeu. 4 avr. 2019 =C3=A0 19:29, Kenneth Brotman > a =C3=A9crit : > >> Alex, >> >> According to this TLP article >> http://thelastpickle.com/blog/2018/09/18/assassinate.html : >> >> Note that the LEFT status should stick around for 72 hours to ensure all >> nodes come to the consensus that the node has been removed. So please do= n't >> rush things if that's the case. Again, it's only cosmetic. >> >> If a gossip state will not forget a node that was removed from the >> cluster more than a week ago: >> >> Login to each node within the Cassandra cluster. >> Download jmxterm on each node, if nodetool assassinate is not an >> option. >> Run nodetool assassinate, or the unsafeAssassinateEndpoint command, >> multiple times in quick succession. >> I typically recommend running the command 3-5 times within 2 >> seconds. >> I understand that sometimes the command takes time to return, so >> the "2 seconds" suggestion is less of a requirement than it is a mindset= . >> Also, sometimes 3-5 times isn't enough. In such cases, shoot for >> the moon and try 20 assassination attempts in quick succession. >> >> What we are trying to do is to create a flood of messages requesting all >> nodes completely forget there used to be an entry within the gossip stat= e >> for the given IP address. If each node can prune its own gossip state an= d >> broadcast that to the rest of the nodes, we should eliminate any race >> conditions that may exist where at least one node still remembers the gi= ven >> IP address. >> >> As soon as all nodes come to agreement that they don't remember the >> deprecated node, the cosmetic issue will no longer be a concern in any >> system.logs, nodetool describecluster commands, nor nodetool gossipinfo >> output. >> >> >> >> >> >> -----Original Message----- >> From: Kenneth Brotman [mailto:kenbrotman@yahoo.com.INVALID] >> Sent: Thursday, April 04, 2019 10:40 AM >> To: user@cassandra.apache.org >> Subject: RE: Assassinate fails >> >> Alex, >> >> Did you remove the option JVM_OPTS=3D"$JVM_OPTS >> -Dcassandra.replace_address=3Daddress_of_dead_node after the node starte= d and >> then restart the node again? >> >> Are you sure there isn't a typo in the file? >> >> Ken >> >> >> -----Original Message----- >> From: Kenneth Brotman [mailto:kenbrotman@yahoo.com.INVALID] >> Sent: Thursday, April 04, 2019 10:31 AM >> To: user@cassandra.apache.org >> Subject: RE: Assassinate fails >> >> I see; system_auth is a separate keyspace. >> >> -----Original Message----- >> From: Jon Haddad [mailto:jon@jonhaddad.com] >> Sent: Thursday, April 04, 2019 10:17 AM >> To: user@cassandra.apache.org >> Subject: Re: Assassinate fails >> >> No, it can't. As Alain (and I) have said, since the system keyspace >> is local strategy, it's not replicated, and thus can't be repaired. >> >> On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman >> wrote: >> > >> > Right, could be similar issue, same type of fix though. >> > >> > -----Original Message----- >> > From: Jon Haddad [mailto:jon@jonhaddad.com] >> > Sent: Thursday, April 04, 2019 9:52 AM >> > To: user@cassandra.apache.org >> > Subject: Re: Assassinate fails >> > >> > System !=3D system_auth. >> > >> > On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman >> > wrote: >> > > >> > > From Mastering Cassandra: >> > > >> > > >> > > Forcing read repairs at consistency =E2=80=93 ALL >> > > >> > > The type of repair isn't really part of the Apache Cassandra repair >> paradigm at all. When it was discovered that a read repair will trigger >> 100% of the time when a query is run at ALL consistency, this method of >> repair started to gain popularity in the community. In some cases, this >> method of forcing data consistency provided better results than normal, >> scheduled repairs. >> > > >> > > Let's assume, for a second, that an application team is having a har= d >> time logging into a node in a new data center. You try to cqlsh out to >> these nodes, and notice that you are also experiencing intermittent >> failures, leading you to suspect that the system_auth tables might be >> missing a replica or two. On one node you do manage to connect successfu= lly >> using cqlsh. One quick way to fix consistency on the system_auth tables = is >> to set consistency to ALL, and run an unbound SELECT on every table, >> tickling each record: >> > > >> > > use system_auth ; >> > > consistency ALL; >> > > consistency level set to ALL. >> > > >> > > SELECT COUNT(*) FROM resource_role_permissons_index ; >> > > SELECT COUNT(*) FROM role_permissions ; >> > > SELECT COUNT(*) FROM role_members ; >> > > SELECT COUNT(*) FROM roles; >> > > >> > > This problem is often seen when logging in with the default cassandr= a >> user. Within cqlsh, there is code that forces the default cassandra user= to >> connect by querying system_auth at QUORUM consistency. This can be >> problematic in larger clusters, and is another reason why you should nev= er >> use the default cassandra user. >> > > >> > > >> > > >> > > -----Original Message----- >> > > From: Jon Haddad [mailto:jon@jonhaddad.com] >> > > Sent: Thursday, April 04, 2019 9:21 AM >> > > To: user@cassandra.apache.org >> > > Subject: Re: Assassinate fails >> > > >> > > Ken, >> > > >> > > Alain is right about the system tables. What you're describing only >> > > works on non-local tables. Changing the CL doesn't help with >> > > keyspaces that use LocalStrategy. Here's the definition of the syst= em >> > > keyspace: >> > > >> > > CREATE KEYSPACE system WITH replication =3D {'class': 'LocalStrategy= '} >> > > AND durable_writes =3D true; >> > > >> > > Jon >> > > >> > > On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman >> > > wrote: >> > > > >> > > > The trick below I got from the book Mastering Cassandra. You have >> to set the consistency to ALL for it to work. I thought you guys knew th= at >> one. >> > > > >> > > > >> > > > >> > > > From: Alain RODRIGUEZ [mailto:arodrime@gmail.com] >> > > > Sent: Thursday, April 04, 2019 8:46 AM >> > > > To: user cassandra.apache.org >> > > > Subject: Re: Assassinate fails >> > > > >> > > > >> > > > >> > > > Hi Alex, >> > > > >> > > > >> > > > >> > > > About previous advices: >> > > > >> > > > >> > > > >> > > > You might have inconsistent data in your system tables. Try >> setting the consistency level to ALL, then do read query of system table= s >> to force repair. >> > > > >> > > > >> > > > >> > > > System tables use the 'LocalStrategy', thus I don't think any >> repair would happen for the system.* tables. Regardless the consistency = you >> use. It should not harm, but I really think it won't help. >> > > > >> > > > >> > > >> > > --------------------------------------------------------------------= - >> > > To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org >> > > For additional commands, e-mail: user-help@cassandra.apache.org >> > > >> > > >> > > --------------------------------------------------------------------= - >> > > To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org >> > > For additional commands, e-mail: user-help@cassandra.apache.org >> > > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org >> > For additional commands, e-mail: user-help@cassandra.apache.org >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org >> > For additional commands, e-mail: user-help@cassandra.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org >> For additional commands, e-mail: user-help@cassandra.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org >> For additional commands, e-mail: user-help@cassandra.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org >> For additional commands, e-mail: user-help@cassandra.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org >> For additional commands, e-mail: user-help@cassandra.apache.org >> > > --0000000000004b535d0591439f67 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello Alex,

long time =C2=A0- I had to wait for a quiet week to try = this. I finally did, I thought I'd give you some feedback.
=

Thanks for taking the time to share this, I guess it mi= ght be useful to some other people around to know the end of the story ;-).=

Glad this worked for you,

C*heers,
-----------------------
Alain Rodriguez= - alain@thelastpickle.com
France / Spain

The Last Pickle - Apache C= assandra Consulting

Le=C2=A0ven. 16 ao=C3= =BBt 2019 =C3=A0=C2=A008:16, Alex <ml@ac= a-o.com> a =C3=A9crit=C2=A0:

Hello Alain,

long time=C2=A0 - I had to wait for a quiet week to try this. I finally = did, I thought I'd give you some feedback.

Short reminder: one of the nodes of my 3.9 cluster died and I replaced i= t. But it still appeared in nodetool status, on one node with a "null&= quot; host_id and on another with the same host_id of its replacement. node= tool assassinate failed and I could not decommission or remove any other no= de on the cluster.

Basically, after backup and preparing another cluster in case anything w= ent wrong, I did :

DELETE FROM system.peers WHERE peer =3D '192.168.1.18';

and restarted cassandra on the two nodes still seeing the zombie node.

After the first restart, the cassandra system.log was filled with:

java.lang.NullPointerException: null
WARN=C2=A0 [MutationStage-2] 201= 9-08-15 15:31:44,735 AbstractLocalAwareExecutorService.java:169 - Uncaught = exception on thread Thread[MutationStage-2,5,main]:

So... I restarted again. The error disappeared. I ran a full repair and = everything seems to be back in order. I could decommission a node without p= roblem.

Thanks for your help !

Alex




Le 05.04.2019 10:55, Alain RODRIGUEZ a =C3=A9crit=C2=A0:

Alex,
=C2=A0
Well, I tri= ed : rolling restart did not work its magic.
=C2=A0
Sorry to hear and for misleading you. May faith into the rolling resta= rt magical power went down a bit, but I still think it was worth a try :D.<= /div>

@ Alain : In system.peers I see= both the dead node and its replacement with the same ID :

=C2=A0=C2=A0 peer=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| hos= t_id
=C2=A0 --------------+--------------------------------------
=C2= =A0=C2=A0 192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
=C2=A0=C2= =A0 192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1

= Is it expected ?

= If I cannot fix this, I think I will add new nodes and remove, one by one, = the nodes that show the dead node in nodetool status.

Well, no. This is clearly not good or expected I would say.
=C2=A0
tl;dr - Suggested fix:
What I would try to fix this is the following is removing this row. It= *should* be safe but that's only my opinion and with the condition you= remove *only* the 'ghost/dead' nodes. Any mistake here would proba= bly be costly. Again, be aware you're on a sensitive part when messing = with system tables. Think it twice, check it twice, take a copy of the SSTa= bles/a snapshot. Then I would go for it and observe changes on one node fir= st. If no harm is done, continue to the next node.
=C2=A0
Considering the old node is '192.1= 68.1.18', I=C2=A0would run this on all nodes (maybe after testing on a = node) to make it simple or just run it on nodes that show the ghost node(s)= :=C2=A0
=C2=A0
"DELETE FROM system.peers WHERE peer =3D '192.168.1.18';"
=C2=A0
Maybe will you need to restart, I thin= k you won't even need it. I have good hope that this should finally=C2= =A0fix your issue with no harm.
=C2=A0
More context - Idea of the problem:
This above, is clearly an issue I would say. Most probably the source = of your troubles here. The problem is that I lack understanding. From where= I stand, this kind of bugs should not happen anymore in Cassandra (I did n= ot see anything similar for a while).
=C2=A0
I would blame:
- A corner case scenario (unlikely, system tables are rather solid for= a while). Or maybe are you using an old C* version. It *might* be related = to this (or similar):=C2=A0https://issu= es.apache.org/jira/browse/CASSANDRA-7122)
- A really weird operation (A succession of action might have put you = in this state, but hard for me to say what)
-=C2=A0KairosDB? I don't=C2=A0know= =C2=A0It or what it does. Might it be less reliable than Cassandra is, and = have lead to this issue? Maybe, I have no clue once again.
=C2=A0
Risk of this operation and cur= rent situation:
Also,=C2=A0I *think* the curren= t situation is relatively 'stable' (maybe just some hints being sto= red for nothing, and possibly not being able to add more nodes or change sc= hema?). This is the kind of situation where 'rushing' a solution wi= thout understanding the impacts and risks can make things to go terribly wr= ong. Take the time to analyse my suggested fix, maybe read the ticket above= etc. When you're ready, backup the data, prepare well the DELETE comma= nd and observe how 1 node reacts to the fix first.
=C2=A0
As you can see, I think it's the 'good' fix, but I'm n= ot comfortable with this operation. And you should not be either :).
I would say, arbitrary to share my feeling about this operation, that = there is 95% chances this does not hurt, 90% chances to fix the issue with = that, but if something goes wrong, if we are in the 5% were it does not go = well, there is a not negligible probability that you will destroy your clus= ter in a very bad way. I guess I try to say be careful, watch your step, ma= ke sure you remove the good line, ensure it works on one node with no harm.=
I shared my feeling and I would try th= is fix. But it's=C2=A0ultimately your=C2=A0responsibility and I won'= ;t be behind the machine when you'll=C2=A0fix it. None of us will.
=C2=A0
Good luck ! :)
=C2=A0
C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com
France / Spain
=C2=A0
The Last Pickle - Apache Cassandra Consulting
=C2=A0
=C2=A0

Le=C2=A0jeu. 4 avr. 2019 =C3=A0=C2=A0= 19:29, Kenneth Brotman <kenbrotman@yahoo.com.invalid> a =C3=A9crit=C2= =A0:
Alex,

According to this TLP article http://thelastpickle.com/blog/2018/09/18/assass= inate.html :

Note that the LEFT status should stick around for= 72 hours to ensure all nodes come to the consensus that the node has been = removed. So please don't rush things if that's the case. Again, it&= #39;s only cosmetic.

If a gossip state will not forget a node that= was removed from the cluster more than a week ago:

=C2=A0 =C2=A0 = Login to each node within the Cassandra cluster.
=C2=A0 =C2=A0 Download= jmxterm on each node, if nodetool assassinate is not an option.
=C2=A0= =C2=A0 Run nodetool assassinate, or the unsafeAssassinateEndpoint command,= multiple times in quick succession.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 I typi= cally recommend running the command 3-5 times within 2 seconds.
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 I understand that sometimes the command takes time to = return, so the "2 seconds" suggestion is less of a requirement th= an it is a mindset.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 Also, sometimes 3-5 tim= es isn't enough. In such cases, shoot for the moon and try 20 assassina= tion attempts in quick succession.

What we are trying to do is to = create a flood of messages requesting all nodes completely forget there use= d to be an entry within the gossip state for the given IP address. If each = node can prune its own gossip state and broadcast that to the rest of the n= odes, we should eliminate any race conditions that may exist where at least= one node still remembers the given IP address.

As soon as all nod= es come to agreement that they don't remember the deprecated node, the = cosmetic issue will no longer be a concern in any system.logs, nodetool des= cribecluster commands, nor nodetool gossipinfo output.



<= br>
-----Original Message-----
From: Kenneth Brotman [mailto:kenbrotman@yah= oo.com.INVALID]
Sent: Thursday, April 04, 2019 10:40 AM
To: user@cassandra= .apache.org
Subject: RE: Assassinate fails

Alex,

= Did you remove the option JVM_OPTS=3D"$JVM_OPTS -Dcassandra.replace_ad= dress=3Daddress_of_dead_node after the node started and then restart the no= de again?

Are you sure there isn't a typo in the file?
Ken


-----Original Message-----
From: Kenneth Brotman [m= ailto:ken= brotman@yahoo.com.INVALID]
Sent: Thursday, April 04, 2019 10:31 AM=
To: use= r@cassandra.apache.org
Subject: RE: Assassinate fails

I se= e; system_auth is a separate keyspace.=C2=A0 =C2=A0

-----Original= Message-----
From: Jon Haddad [mailto:jon@jonhaddad.com]
Sent: Thursday, April 04= , 2019 10:17 AM
To: user@cassandra.apache.org
Subject: Re: Assassinate fail= s

No, it can't.=C2=A0 As Alain (and I) have said, since the sy= stem keyspace
is local strategy, it's not replicated, and thus can&= #39;t be repaired.

On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman<= br> <kenbrotman@yahoo.com.invalid> wrote:
>
> Right, co= uld be similar issue, same type of fix though.
>
> -----Origi= nal Message-----
> From: Jon Haddad [mailto:jon@jonhaddad.com]
> Sent: Thursd= ay, April 04, 2019 9:52 AM
> To: user@cassandra.apache.org
> Subject:= Re: Assassinate fails
>
> System !=3D system_auth.
><= br> > On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman
> <kenbro= tman@yahoo.com.invalid> wrote:
> >
> > From Masterin= g Cassandra:
> >
> >
> > Forcing read repairs= at consistency =E2=80=93 ALL
> >
> > The type of repai= r isn't really part of the Apache Cassandra repair paradigm at all. Whe= n it was discovered that a read repair will trigger 100% of the time when a= query is run at ALL consistency, this method of repair started to gain pop= ularity in the community. In some cases, this method of forcing data consis= tency provided better results than normal, scheduled repairs.
> >=
> > Let's assume, for a second, that an application team is = having a hard time logging into a node in a new data center. You try to cql= sh out to these nodes, and notice that you are also experiencing intermitte= nt failures, leading you to suspect that the system_auth tables might be mi= ssing a replica or two. On one node you do manage to connect successfully u= sing cqlsh. One quick way to fix consistency on the system_auth tables is t= o set consistency to ALL, and run an unbound SELECT on every table, ticklin= g each record:
> >
> > use system_auth ;
> > = consistency ALL;
> > consistency level set to ALL.
> ><= br> > > SELECT COUNT(*) FROM resource_role_permissons_index ;
>= ; > SELECT COUNT(*) FROM role_permissions ;
> > SELECT COUNT(*= ) FROM role_members ;
> > SELECT COUNT(*) FROM roles;
> &g= t;
> > This problem is often seen when logging in with the defaul= t cassandra user. Within cqlsh, there is code that forces the default cassa= ndra user to connect by querying system_auth at QUORUM consistency. This ca= n be problematic in larger clusters, and is another reason why you should n= ever use the default cassandra user.
> >
> >
> &= gt;
> > -----Original Message-----
> > From: Jon Haddad= [mailto:jon@jonhadd= ad.com]
> > Sent: Thursday, April 04, 2019 9:21 AM
> &= gt; To: user= @cassandra.apache.org
> > Subject: Re: Assassinate fails
= > >
> > Ken,
> >
> > Alain is right abo= ut the system tables.=C2=A0 What you're describing only
> > w= orks on non-local tables.=C2=A0 Changing the CL doesn't help with
&= gt; > keyspaces that use LocalStrategy.=C2=A0 Here's the definition = of the system
> > keyspace:
> >
> > CREATE KE= YSPACE system WITH replication =3D {'class': 'LocalStrategy'= ;}
> > AND durable_writes =3D true;
> >
> > J= on
> >
> > On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotm= an
> > <kenbrotman@yahoo.com.invalid> wrote:
> > = >
> > > The trick below I got from the book Mastering Cassa= ndra.=C2=A0 You have to set the consistency to ALL for it to work. I though= t you guys knew that one.
> > >
> > >
> &g= t; >
> > > From: Alain RODRIGUEZ [mailto:arodrime@gmail.com]
> >= > Sent: Thursday, April 04, 2019 8:46 AM
> > > To: user cassandra.apache.org
> > > Subject: Re: Assassina= te fails
> > >
> > >
> > >
> = > > Hi Alex,
> > >
> > >
> > >=
> > > About previous advices:
> > >
> >= ; >
> > >
> > > You might have inconsistent da= ta in your system tables.=C2=A0 Try setting the consistency level to ALL, t= hen do read query of system tables to force repair.
> > >
= > > >
> > >
> > > System tables use the = 'LocalStrategy', thus I don't think any repair would happen for= the system.* tables. Regardless the consistency you use. It should not har= m, but I really think it won't help.
> > >
> > &= gt;
> >
> > -------------------------------------------= --------------------------
> > To unsubscribe, e-mail: user-unsub= scribe@cassandra.apache.org
> > For additional commands, e-ma= il: use= r-help@cassandra.apache.org
> >
> >
> > -= --------------------------------------------------------------------
&g= t; > To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org > > For additional commands, e-mail: user-help@cassandra.apache.org > >
>
> --------------------------------------------= -------------------------
> To unsubscribe, e-mail: user-unsubscribe= @cassandra.apache.org
> For additional commands, e-mail: user-help@cass= andra.apache.org
>
>
> ---------------------------= ------------------------------------------
> To unsubscribe, e-mail:= user-unsubscribe@cassandra.apache.org
> For additional commands= , e-mail: user-help@cassandra.apache.org
>

-------------------= --------------------------------------------------
To unsubscribe, e-ma= il: user-unsubscribe@cassandra.apache.org
For additional commands, = e-mail: user-help@cassandra.apache.org


-------------------------= --------------------------------------------
To unsubscribe, e-mail: us= er-unsubscribe@cassandra.apache.org
For additional commands, e-mail= : user-= help@cassandra.apache.org


-------------------------------= --------------------------------------
To unsubscribe, e-mail: user-un= subscribe@cassandra.apache.org
For additional commands, e-mail: user-help@= cassandra.apache.org


------------------------------------= ---------------------------------
To unsubscribe, e-mail: user-unsubscr= ibe@cassandra.apache.org
For additional commands, e-mail: user-help@cass= andra.apache.org


--0000000000004b535d0591439f67--