Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: softfail (nike.apache.org: transitioning domain of
 nicolas.lalevee@hibnet.org does not designate 216.86.168.183 as permitted
 sender)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Apple Message framework v1084)
Subject: Re: Dead node still being pinged
From: =?iso-8859-1?Q?Nicolas_Lalev=E9e?= <nicolas.lalevee@hibnet.org>
In-Reply-To: <373F7247-8670-4DBC-8FF3-37B3F9141092@thelastpickle.com>
Date: Tue, 12 Jun 2012 12:25:51 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <7B568B55-6A44-4E06-B205-2B374C13365E@hibnet.org>
References: 
 <OF04417243.EA95D2FC-ONC1257A1A.0037BC5C-C1257A1A.00380EE0@urssaf.fr>
 <AC059ABB-9954-42DC-930C-5362CA4FC282@hibnet.org>
 <373F7247-8670-4DBC-8FF3-37B3F9141092@thelastpickle.com>
To: user@cassandra.apache.org

Le 12 juin 2012 =E0 11:03, aaron morton a =E9crit :

> Try purging the hints for 10.10.0.24 using the HintedHandOffManager =
MBean.

As far as I could tell, there were no hinted hand off to be delivered. =
Nevertheless I have called "deleteHintsForEndpoint" on every node for =
the two expected to be out nodes.
Nothing changed, I still see packet being send to these old nodes.

I looked closer to ResponsePendingTasks of MessagingService. Actually =
the numbers change, between 0 and about 4. So tasks are ending but new =
ones come just after.

Nicolas

>=20
> Cheers
>=20
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>=20
> On 12/06/2012, at 3:33 AM, Nicolas Lalev=E9e wrote:
>=20
>> finally, thanks to the groovy jmx builder, it was not that hard.
>>=20
>>=20
>> Le 11 juin 2012 =E0 12:12, Samuel CARRIERE a =E9crit :
>>=20
>>> If I were you, I would connect (through JMX, with jconsole) to one =
of the nodes that is sending messages to an old node, and would have a =
look at these MBean :=20
>>>   - org.apache.net.FailureDetector : does SimpleStates looks good ? =
(or do you see an IP of an old node)
>>=20
>> SimpleStates:[/10.10.0.22:DOWN, /10.10.0.24:DOWN, /10.10.0.26:UP, =
/10.10.0.25:UP, /10.10.0.27:UP]
>>=20
>>>   - org.apache.net.MessagingService : do you see one of the old IP =
in one of the attributes ?
>>=20
>> data-5:
>> CommandCompletedTasks:
>> [10.10.0.22:2, 10.10.0.26:6147307, 10.10.0.27:6084684, 10.10.0.24:2]
>> CommandPendingTasks:
>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>> ResponseCompletedTasks:
>> [10.10.0.22:1487, 10.10.0.26:6187204, 10.10.0.27:6062890, =
10.10.0.24:1495]
>> ResponsePendingTasks:
>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.27:0, 10.10.0.24:0]
>>=20
>> data-6:
>> CommandCompletedTasks:
>> [10.10.0.22:2, 10.10.0.27:6064992, 10.10.0.24:2, 10.10.0.25:6308102]
>> CommandPendingTasks:
>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:0, 10.10.0.25:0]
>> ResponseCompletedTasks:
>> [10.10.0.22:1463, 10.10.0.27:6067943, 10.10.0.24:1474, =
10.10.0.25:6367692]
>> ResponsePendingTasks:
>> [10.10.0.22:0, 10.10.0.27:0, 10.10.0.24:2, 10.10.0.25:0]
>>=20
>> data-7:
>> CommandCompletedTasks:
>> [10.10.0.22:2, 10.10.0.26:6043653, 10.10.0.24:2, 10.10.0.25:5964168]
>> CommandPendingTasks:
>> [10.10.0.22:0, 10.10.0.26:0, 10.10.0.24:0, 10.10.0.25:0]
>> ResponseCompletedTasks:
>> [10.10.0.22:1424, 10.10.0.26:6090251, 10.10.0.24:1431, =
10.10.0.25:6094954]
>> ResponsePendingTasks:
>> [10.10.0.22:4, 10.10.0.26:0, 10.10.0.24:1, 10.10.0.25:0]
>>=20
>>>   - org.apache.net.StreamingService : do you see an old IP in =
StreamSources or StreamDestinations ?
>>=20
>> nothing streaming on the 3 nodes.
>> nodetool netstats confirmed that.
>>=20
>>>   - org.apache.internal.HintedHandoff : are there non-zero =
ActiveCount, CurrentlyBlockedTasks, PendingTasks, TotalBlockedTask ?
>>=20
>> On the 3 nodes, all at 0.
>>=20
>> I don't know much what I'm looking at, but it seems that some =
ResponsePendingTasks needs to end.
>>=20
>> Nicolas
>>=20
>>>=20
>>> Samuel=20
>>>=20
>>>=20
>>>=20
>>> Nicolas Lalev=E9e <nicolas.lalevee@hibnet.org>
>>> 08/06/2012 21:03
>>> Veuillez r=E9pondre =E0
>>> user@cassandra.apache.org
>>>=20
>>> A
>>> user@cassandra.apache.org
>>> cc
>>> Objet
>>> Re: Dead node still being pinged
>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>> Le 8 juin 2012 =E0 20:02, Samuel CARRIERE a =E9crit :
>>>=20
>>>> I'm in the train but just a guess : maybe it's hinted handoff. A =
look in the logs of the new nodes could confirm that : look for the IP =
of an old node and maybe you'll find hinted handoff related messages.
>>>=20
>>> I grepped on every node about every old node, I got nothing since =
the "crash".
>>>=20
>>> If it can be of some help, here is some grepped log of the crash:
>>>=20
>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 =
00:39:30,241 StorageService.java (line 2417) Endpoint /10.10.0.24 is =
down and will not receive data for re-replication of /10.10.0.22
>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 =
00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is =
down and will not receive data for re-replication of /10.10.0.22
>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 =
00:39:30,242 StorageService.java (line 2417) Endpoint /10.10.0.24 is =
down and will not receive data for re-replication of /10.10.0.22
>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 =
00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is =
down and will not receive data for re-replication of /10.10.0.22
>>> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 =
00:39:30,243 StorageService.java (line 2417) Endpoint /10.10.0.24 is =
down and will not receive data for re-replication of /10.10.0.22
>>> system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 =
Gossiper.java (line 818) InetAddress /10.10.0.24 is now dead.
>>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 =
Gossiper.java (line 818) InetAddress /10.10.0.24 is now dead.
>>> system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 =
HintedHandOffManager.java (line 179) Deleting any stored hints for =
/10.10.0.24
>>> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 =
StorageService.java (line 1157) Removing token =
127605887595351923798765477786913079296 for /10.10.0.24
>>> system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 =
Gossiper.java (line 818) InetAddress /10.10.0.24 is now dead.
>>>=20
>>>=20
>>> Maybe its the way I have removed nodes ? AFAIR I didn't used the =
decommission command. For each node I got the node down and then issue a =
remove token command.
>>> Here is what I can find in the log about when I removed one of them:
>>>=20
>>> system.log.1: INFO [GossipTasks:1] 2012-05-02 17:21:10,281 =
Gossiper.java (line 818) InetAddress /10.10.0.24 is now dead.
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:21:21,496 =
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before =
hint delivery, aborting
>>> system.log.1: INFO [GossipStage:1] 2012-05-02 17:21:59,307 =
Gossiper.java (line 818) InetAddress /10.10.0.24 is now dead.
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:31:20,336 =
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before =
hint delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:41:06,177 =
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before =
hint delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:51:18,148 =
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before =
hint delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:00:31,709 =
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before =
hint delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:11:02,521 =
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before =
hint delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:20:38,282 =
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before =
hint delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:31:09,513 =
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before =
hint delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:40:31,565 =
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before =
hint delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:51:10,566 =
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before =
hint delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:00:32,197 =
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before =
hint delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:11:17,018 =
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before =
hint delivery, aborting
>>> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:21:21,759 =
HintedHandOffManager.java (line 292) Endpoint /10.10.0.24 died before =
hint delivery, aborting
>>> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 =
Gossiper.java (line 818) InetAddress /10.10.0.24 is now dead.
>>> system.log.1: INFO [OptionalTasks:1] 2012-05-02 20:05:57,281 =
HintedHandOffManager.java (line 179) Deleting any stored hints for =
/10.10.0.24
>>> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 =
StorageService.java (line 1157) Removing token =
145835300108973619103103718265651724288 for /10.10.0.24
>>>=20
>>>=20
>>> Nicolas
>>>=20
>>>=20
>>>>=20
>>>>=20
>>>> ----- Message d'origine -----
>>>> De : Nicolas Lalev=E9e [nicolas.lalevee@hibnet.org]
>>>> Envoy=E9 : 08/06/2012 19:26 ZE2
>>>> =C0 : user@cassandra.apache.org
>>>> Objet : Re: Dead node still being pinged
>>>>=20
>>>>=20
>>>>=20
>>>> Le 8 juin 2012 =E0 15:17, Samuel CARRIERE a =E9crit :
>>>>=20
>>>>> What does nodetool ring says ? (Ask every node)
>>>>=20
>>>> currently, each of new node see only the tokens of the new nodes.
>>>>=20
>>>>> Have you checked that the list of seeds in every yaml is correct ?
>>>>=20
>>>> yes, it is correct, every of my new node point to the first of my =
new node
>>>>=20
>>>>> What version of cassandra are you using ?
>>>>=20
>>>> Sorry I should have wrote this in my first mail.
>>>> I use the 1.0.9
>>>>=20
>>>> Nicolas
>>>>=20
>>>>>=20
>>>>> Samuel
>>>>>=20
>>>>>=20
>>>>>=20
>>>>> Nicolas Lalev=E9e <nicolas.lalevee@hibnet.org>
>>>>> 08/06/2012 14:10
>>>>> Veuillez r=E9pondre =E0
>>>>> user@cassandra.apache.org
>>>>>=20
>>>>> A
>>>>> user@cassandra.apache.org
>>>>> cc
>>>>> Objet
>>>>> Dead node still being pinged
>>>>>=20
>>>>>=20
>>>>>=20
>>>>>=20
>>>>>=20
>>>>> I had a configuration where I had 4 nodes, data-1,4. We then =
bought 3 bigger machines, data-5,7. And we moved all data from data-1,4 =
to data-5,7.
>>>>> To move all the data without interruption of service, I added one =
new node at a time. And then I removed one by one the old machines via a =
"remove token".
>>>>>=20
>>>>> Everything was working fine. Until there was an expected load on =
our cluster, the machine started to swap and become unresponsive. We =
fixed the unexpected load and the three new machines were restarted. =
After that the new cassandra machines were stating that some old token =
were not assigned, namely from data-2 and data-4. To fix this I issued =
again some "remove token" commands.
>>>>>=20
>>>>> Everything seems to be back to normal, but on the network I still =
see some packet from the new cluster to the old machines. On the port =
7000.
>>>>> How I can tell cassandra to completely forget about the old =
machines ?
>>>>>=20
>>>>> Nicolas
>>>>>=20
>>>>>=20
>>>>=20
>>>=20
>>>=20
>>=20
>=20