Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1B9EA9F3A for ; Tue, 12 Jun 2012 10:26:27 +0000 (UTC) Received: (qmail 95189 invoked by uid 500); 12 Jun 2012 10:26:25 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 94774 invoked by uid 500); 12 Jun 2012 10:26:23 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 94743 invoked by uid 99); 12 Jun 2012 10:26:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jun 2012 10:26:22 +0000 X-ASF-Spam-Status: No, hits=1.0 required=5.0 tests=FSL_RCVD_USER,SPF_HELO_PASS,SPF_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: softfail (nike.apache.org: transitioning domain of nicolas.lalevee@hibnet.org does not designate 216.86.168.183 as permitted sender) Received: from [216.86.168.183] (HELO mxout-08.mxes.net) (216.86.168.183) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jun 2012 10:26:15 +0000 Received: from [10.0.0.238] (unknown [92.103.90.130]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id 44D1250A5D for ; Tue, 12 Jun 2012 06:25:52 -0400 (EDT) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1084) Subject: Re: Dead node still being pinged From: =?iso-8859-1?Q?Nicolas_Lalev=E9e?= In-Reply-To: <373F7247-8670-4DBC-8FF3-37B3F9141092@thelastpickle.com> Date: Tue, 12 Jun 2012 12:25:51 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <7B568B55-6A44-4E06-B205-2B374C13365E@hibnet.org> References: <373F7247-8670-4DBC-8FF3-37B3F9141092@thelastpickle.com> To: user@cassandra.apache.org X-Mailer: Apple Mail (2.1084) Le 12 juin 2012 =E0 11:03, aaron morton a =E9crit : > Try purging the hints for 10.10.0.24 using the HintedHandOffManager = MBean. As far as I could tell, there were no hinted hand off to be delivered. = Nevertheless I have called "deleteHintsForEndpoint" on every node for = the two expected to be out nodes. Nothing changed, I still see packet being send to these old nodes. I looked closer to ResponsePendingTasks of MessagingService. Actually = the numbers change, between 0 and about 4. So tasks are ending but new = ones come just after. Nicolas >=20 > Cheers >=20 > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com >=20 > On 12/06/2012, at 3:33 AM, Nicolas Lalev=E9e wrote: >=20 >> finally, thanks to the groovy jmx builder, it was not that hard. >>=20 >>=20 >> Le 11 juin 2012 =E0 12:12, Samuel CARRIERE a =E9crit : >>=20 >>> If I were you, I would connect (through JMX, with jconsole) to one = of the nodes that is sending messages to an old node, and would have a = look at these MBean :=20 >>> - org.apache.net.FailureDetector : does SimpleStates looks good ? = (or do you see an IP of an old node) >>=20 >> SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, = /10.10.0.25:UP, /10.10.0.27:UP] >>=20 >>> - org.apache.net.MessagingService : do you see one of the old IP = in one of the attributes ? >>=20 >> data-5: >> CommandCompletedTasks: >> [10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2] >> CommandPendingTasks: >> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0] >> ResponseCompletedTasks: >> [10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, = 10.10.0.24:1495] >> ResponsePendingTasks: >> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0] >>=20 >> data-6: >> CommandCompletedTasks: >> [10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102] >> CommandPendingTasks: >> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0] >> ResponseCompletedTasks: >> [10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, = 10.10.0.25:6367692] >> ResponsePendingTasks: >> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0] >>=20 >> data-7: >> CommandCompletedTasks: >> [10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168] >> CommandPendingTasks: >> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0] >> ResponseCompletedTasks: >> [10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, = 10.10.0.25:6094954] >> ResponsePendingTasks: >> [10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0] >>=20 >>> - org.apache.net.StreamingService : do you see an old IP in = StreamSources or StreamDestinations ? >>=20 >> nothing streaming on the 3 nodes. >> nodetool netstats confirmed that. >>=20 >>> - org.apache.internal.HintedHandoff : are there non-zero = ActiveCount, CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ? >>=20 >> On the 3 nodes, all at 0. >>=20 >> I don't know much what I'm looking at, but it seems that some = ResponsePendingTasks needs to end. >>=20 >> Nicolas >>=20 >>>=20 >>> Samuel=20 >>>=20 >>>=20 >>>=20 >>> Nicolas Lalev=E9e >>> 08/06/2012 21:03 >>> Veuillez r=E9pondre =E0 >>> user@cassandra.apache.org >>>=20 >>> A >>> user@cassandra.apache.org >>> cc >>> Objet >>> Re: Dead node still being pinged >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>> Le 8 juin 2012 =E0 20:02, Samuel CARRIERE a =E9crit : >>>=20 >>>> I'm in the train but just a guess : maybe it's hinted handoff. A = look in the logs of the new nodes could confirm that : look for the IP = of an old node and maybe you'll find hinted handoff related messages. >>>=20 >>> I grepped on every node about every old node, I got nothing since = the "crash". >>>=20 >>> If it can be of some help, here is some grepped log of the crash: >>>=20 >>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 = 00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is = down and will not receive data for re-replication of /10.10.0.22 >>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 = 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is = down and will not receive data for re-replication of /10.10.0.22 >>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 = 00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is = down and will not receive data for re-replication of /10.10.0.22 >>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 = 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is = down and will not receive data for re-replication of /10.10.0.22 >>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 = 00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is = down and will not receive data for re-replication of /10.10.0.22 >>> system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 = Gossiper.java (line 818) InetAddress /10.10.0.24 is now dead. >>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 = Gossiper.java (line 818) InetAddress /10.10.0.24 is now dead. >>> system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 = HintedHandOffManager.java (line 179) Deleting any stored hints for = /10.10.0.24 >>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 = StorageService.java (line 1157) Removing token = 127605887595351923798765477786913079296 for /10.10.0.24 >>> system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 = Gossiper.java (line 818) InetAddress /10.10.0.24 is now dead. >>>=20 >>>=20 >>> Maybe its the way I have removed nodes ? AFAIR I didn't used the = decommission command. For each node I got the node down and then issue a = remove token command. >>> Here is what I can find in the log about when I removed one of them: >>>=20 >>> system.log.1: INFO [GossipTasks:1] 2012-05-02 17:21:10,281 = Gossiper.java (line 818) InetAddress /10.10.0.24 is now dead. >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:21:21,496 = HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before = hint delivery, aborting >>> system.log.1: INFO [GossipStage:1] 2012-05-02 17:21:59,307 = Gossiper.java (line 818) InetAddress /10.10.0.24 is now dead. >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:31:20,336 = HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before = hint delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:41:06,177 = HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before = hint delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:51:18,148 = HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before = hint delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:00:31,709 = HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before = hint delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:11:02,521 = HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before = hint delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:20:38,282 = HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before = hint delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:31:09,513 = HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before = hint delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:40:31,565 = HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before = hint delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:51:10,566 = HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before = hint delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:00:32,197 = HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before = hint delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:11:17,018 = HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before = hint delivery, aborting >>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:21:21,759 = HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before = hint delivery, aborting >>> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 = Gossiper.java (line 818) InetAddress /10.10.0.24 is now dead. >>> system.log.1: INFO [OptionalTasks:1] 2012-05-02 20:05:57,281 = HintedHandOffManager.java (line 179) Deleting any stored hints for = /10.10.0.24 >>> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 = StorageService.java (line 1157) Removing token = 145835300108973619103103718265651724288 for /10.10.0.24 >>>=20 >>>=20 >>> Nicolas >>>=20 >>>=20 >>>>=20 >>>>=20 >>>> ----- Message d'origine ----- >>>> De : Nicolas Lalev=E9e [nicolas.lalevee@hibnet.org] >>>> Envoy=E9 : 08/06/2012 19:26 ZE2 >>>> =C0 : user@cassandra.apache.org >>>> Objet : Re: Dead node still being pinged >>>>=20 >>>>=20 >>>>=20 >>>> Le 8 juin 2012 =E0 15:17, Samuel CARRIERE a =E9crit : >>>>=20 >>>>> What does nodetool ring says ? (Ask every node) >>>>=20 >>>> currently, each of new node see only the tokens of the new nodes. >>>>=20 >>>>> Have you checked that the list of seeds in every yaml is correct ? >>>>=20 >>>> yes, it is correct, every of my new node point to the first of my = new node >>>>=20 >>>>> What version of cassandra are you using ? >>>>=20 >>>> Sorry I should have wrote this in my first mail. >>>> I use the 1.0.9 >>>>=20 >>>> Nicolas >>>>=20 >>>>>=20 >>>>> Samuel >>>>>=20 >>>>>=20 >>>>>=20 >>>>> Nicolas Lalev=E9e >>>>> 08/06/2012 14:10 >>>>> Veuillez r=E9pondre =E0 >>>>> user@cassandra.apache.org >>>>>=20 >>>>> A >>>>> user@cassandra.apache.org >>>>> cc >>>>> Objet >>>>> Dead node still being pinged >>>>>=20 >>>>>=20 >>>>>=20 >>>>>=20 >>>>>=20 >>>>> I had a configuration where I had 4 nodes, data-1,4. We then = bought 3 bigger machines, data-5,7. And we moved all data from data-1,4 = to data-5,7. >>>>> To move all the data without interruption of service, I added one = new node at a time. And then I removed one by one the old machines via a = "remove token". >>>>>=20 >>>>> Everything was working fine. Until there was an expected load on = our cluster, the machine started to swap and become unresponsive. We = fixed the unexpected load and the three new machines were restarted. = After that the new cassandra machines were stating that some old token = were not assigned, namely from data-2 and data-4. To fix this I issued = again some "remove token" commands. >>>>>=20 >>>>> Everything seems to be back to normal, but on the network I still = see some packet from the new cluster to the old machines. On the port = 7000. >>>>> How I can tell cassandra to completely forget about the old = machines ? >>>>>=20 >>>>> Nicolas >>>>>=20 >>>>>=20 >>>>=20 >>>=20 >>>=20 >>=20 >=20