cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Lalevée <nicolas.lale...@hibnet.org>
Subject Re: Dead node still being pinged
Date Mon, 11 Jun 2012 14:46:09 GMT

Le 11 juin 2012 à 12:12, Samuel CARRIERE a écrit :

> 
> Well, I don't see anything special in the logs. "Remove token" seems to have done its
job : accorging to the logs, old stored hints have been deleted. 
> 
> If I were you, I would connect (through JMX, with jconsole) to one of the nodes that
is sending messages to an old node, and would have a look at these MBean : 
>    - org.apache.net.FailureDetector : does SimpleStates looks good ? (or do you see an
IP of an old node) 
>    - org.apache.net.MessagingService : do you see one of the old IP in one of the attributes
? 
>    - org.apache.net.StreamingService : do you see an old IP in StreamSources or StreamDestinations
? 
>    - org.apache.internal.HintedHandoff : are there non-zero ActiveCount, CurrentlyBlockedTasks,
PendingTasks, TotalBlockedTask ? 

I feared I had too do such lookups... JMX sucks when there is some ssh tunneling to do. I'll
get time to look into thoses. Thanks.

By the way, maybe an interesting info (same on every node):
root@data-5 ~ # nodetool -h data-local gossipinfo
/10.10.0.27
  LOAD:2.34205351889E11
  SCHEMA:21099fc0-978c-11e1-0000-bc70eee231ef
  RPC_ADDRESS:10.10.0.27
  STATUS:NORMAL,113427455640312814857969558651062452224
  RELEASE_VERSION:1.0.9
/10.10.0.26
  LOAD:2.64617657147E11
  SCHEMA:21099fc0-978c-11e1-0000-bc70eee231ef
  RPC_ADDRESS:10.10.0.26
  STATUS:NORMAL,56713727820156407428984779325531226112
  RELEASE_VERSION:1.0.9
/10.10.0.25
  LOAD:2.34154095981E11
  SCHEMA:21099fc0-978c-11e1-0000-bc70eee231ef
  RPC_ADDRESS:10.10.0.25
  STATUS:NORMAL,0
  RELEASE_VERSION:1.0.9
/10.10.0.24
  STATUS:removed,127605887595351923798765477786913079296,1336530323263
  REMOVAL_COORDINATOR:REMOVER,0
/10.10.0.22
  STATUS:removed,42535295865117307932921825928971026432,1336529659203
  REMOVAL_COORDINATOR:REMOVER,113427455640312814857969558651062452224


Nicolas


> 
> Samuel 
> 
> 
> 
> Nicolas Lalevée <nicolas.lalevee@hibnet.org>
> 08/06/2012 21:03
> Veuillez répondre à
> user@cassandra.apache.org
> 
> A
> user@cassandra.apache.org
> cc
> Objet
> Re: Dead node still being pinged
> 
> 
> 
> 
> 
> 
> Le 8 juin 2012 à 20:02, Samuel CARRIERE a écrit :
> 
> > I'm in the train but just a guess : maybe it's hinted handoff. A look in the logs
of the new nodes could confirm that : look for the IP of an old node and maybe you'll find
hinted handoff related messages.
> 
> I grepped on every node about every old node, I got nothing since the "crash".
> 
> If it can be of some help, here is some grepped log of the crash:
> 
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 00:39:30,241 StorageService.java
(line 2417) Endpoint /10.10.0.24 is down and will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 00:39:30,242 StorageService.java
(line 2417) Endpoint /10.10.0.24 is down and will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 00:39:30,242 StorageService.java
(line 2417) Endpoint /10.10.0.24 is down and will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 00:39:30,243 StorageService.java
(line 2417) Endpoint /10.10.0.24 is down and will not receive data for re-replication of /10.10.0.22
> system.log.1: WARN [RMI TCP Connection(1037)-10.10.0.26] 2012-05-06 00:39:30,243 StorageService.java
(line 2417) Endpoint /10.10.0.24 is down and will not receive data for re-replication of /10.10.0.22
> system.log.1: INFO [GossipStage:1] 2012-05-06 00:44:33,822 Gossiper.java (line 818) InetAddress
/10.10.0.24 is now dead.
> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,894 Gossiper.java (line 818) InetAddress
/10.10.0.24 is now dead.
> system.log.1: INFO [OptionalTasks:1] 2012-05-06 04:25:23,895 HintedHandOffManager.java
(line 179) Deleting any stored hints for /10.10.0.24
> system.log.1: INFO [GossipStage:1] 2012-05-06 04:25:23,895 StorageService.java (line
1157) Removing token 127605887595351923798765477786913079296 for /10.10.0.24
> system.log.1: INFO [GossipStage:1] 2012-05-09 04:26:25,015 Gossiper.java (line 818) InetAddress
/10.10.0.24 is now dead.
> 
> 
> Maybe its the way I have removed nodes ? AFAIR I didn't used the decommission command.
For each node I got the node down and then issue a remove token command.
> Here is what I can find in the log about when I removed one of them:
> 
> system.log.1: INFO [GossipTasks:1] 2012-05-02 17:21:10,281 Gossiper.java (line 818) InetAddress
/10.10.0.24 is now dead.
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:21:21,496 HintedHandOffManager.java
(line 292) Endpoint /10.10.0.24 died before hint delivery, aborting
> system.log.1: INFO [GossipStage:1] 2012-05-02 17:21:59,307 Gossiper.java (line 818) InetAddress
/10.10.0.24 is now dead.
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:31:20,336 HintedHandOffManager.java
(line 292) Endpoint /10.10.0.24 died before hint delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:41:06,177 HintedHandOffManager.java
(line 292) Endpoint /10.10.0.24 died before hint delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 17:51:18,148 HintedHandOffManager.java
(line 292) Endpoint /10.10.0.24 died before hint delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:00:31,709 HintedHandOffManager.java
(line 292) Endpoint /10.10.0.24 died before hint delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:11:02,521 HintedHandOffManager.java
(line 292) Endpoint /10.10.0.24 died before hint delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:20:38,282 HintedHandOffManager.java
(line 292) Endpoint /10.10.0.24 died before hint delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:31:09,513 HintedHandOffManager.java
(line 292) Endpoint /10.10.0.24 died before hint delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:40:31,565 HintedHandOffManager.java
(line 292) Endpoint /10.10.0.24 died before hint delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 18:51:10,566 HintedHandOffManager.java
(line 292) Endpoint /10.10.0.24 died before hint delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:00:32,197 HintedHandOffManager.java
(line 292) Endpoint /10.10.0.24 died before hint delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:11:17,018 HintedHandOffManager.java
(line 292) Endpoint /10.10.0.24 died before hint delivery, aborting
> system.log.1: INFO [HintedHandoff:1] 2012-05-02 19:21:21,759 HintedHandOffManager.java
(line 292) Endpoint /10.10.0.24 died before hint delivery, aborting
> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 Gossiper.java (line 818) InetAddress
/10.10.0.24 is now dead.
> system.log.1: INFO [OptionalTasks:1] 2012-05-02 20:05:57,281 HintedHandOffManager.java
(line 179) Deleting any stored hints for /10.10.0.24
> system.log.1: INFO [GossipStage:1] 2012-05-02 20:05:57,281 StorageService.java (line
1157) Removing token 145835300108973619103103718265651724288 for /10.10.0.24
> 
> 
> Nicolas
> 
> 
> > 
> > 
> > ----- Message d'origine -----
> > De : Nicolas Lalevée [nicolas.lalevee@hibnet.org]
> > Envoyé : 08/06/2012 19:26 ZE2
> > À : user@cassandra.apache.org
> > Objet : Re: Dead node still being pinged
> > 
> > 
> > 
> > Le 8 juin 2012 à 15:17, Samuel CARRIERE a écrit :
> > 
> >> What does nodetool ring says ? (Ask every node)
> > 
> > currently, each of new node see only the tokens of the new nodes.
> > 
> >> Have you checked that the list of seeds in every yaml is correct ?
> > 
> > yes, it is correct, every of my new node point to the first of my new node
> > 
> >> What version of cassandra are you using ?
> > 
> > Sorry I should have wrote this in my first mail.
> > I use the 1.0.9
> > 
> > Nicolas
> > 
> >> 
> >> Samuel
> >> 
> >> 
> >> 
> >> Nicolas Lalevée <nicolas.lalevee@hibnet.org>
> >> 08/06/2012 14:10
> >> Veuillez répondre à
> >> user@cassandra.apache.org
> >> 
> >> A
> >> user@cassandra.apache.org
> >> cc
> >> Objet
> >> Dead node still being pinged
> >> 
> >> 
> >> 
> >> 
> >> 
> >> I had a configuration where I had 4 nodes, data-1,4. We then bought 3 bigger
machines, data-5,7. And we moved all data from data-1,4 to data-5,7.
> >> To move all the data without interruption of service, I added one new node at
a time. And then I removed one by one the old machines via a "remove token".
> >> 
> >> Everything was working fine. Until there was an expected load on our cluster,
the machine started to swap and become unresponsive. We fixed the unexpected load and the
three new machines were restarted. After that the new cassandra machines were stating that
some old token were not assigned, namely from data-2 and data-4. To fix this I issued again
some "remove token" commands.
> >> 
> >> Everything seems to be back to normal, but on the network I still see some packet
from the new cluster to the old machines. On the port 7000.
> >> How I can tell cassandra to completely forget about the old machines ?
> >> 
> >> Nicolas
> >> 
> >> 
> > 
> 
> 


Mime
View raw message