cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mariusz Gronczewski (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-5154) Gossip sends removed node which causes restarted nodes to constantly create new threads
Date Tue, 15 Jan 2013 14:28:21 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-5154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553836#comment-13553836
] 

Mariusz Gronczewski commented on CASSANDRA-5154:
------------------------------------------------

I did simple experiment, I've started 3 dev servers, decomissioned one , restarted cassandra
on one and then moved time on both to future:
restarted server:

steps to reproduce
* start 3 cassandra nodes
* on node 3 'nodetool decommission'
* wait a week or move time week in future
* restart node 2

restarted node:
{code}
[15:15:18]dev41:~☠ /etc/init.d/cassandra restart
Shutdown Cassandra:  .  . done
Starting Cassandra: OK
[15:15:24]dev41:~☠ nodetool gossipinfo
/10.0.100.51
  SCHEMA:38349a62-3e49-3b99-84af-f675dbdc3137
  RPC_ADDRESS:0.0.0.0
  STATUS:NORMAL,56713727820156407428984779325531226112
  RELEASE_VERSION:1.1.7
  LOAD:2.671287941E9
/10.0.100.52
  SCHEMA:38349a62-3e49-3b99-84af-f675dbdc3137
  RPC_ADDRESS:0.0.0.0
  STATUS:LEFT,149795726937939425132952703051633828837,1358518463257
  RELEASE_VERSION:1.1.7
  LOAD:6.23579072E8
/10.0.100.50
  SCHEMA:38349a62-3e49-3b99-84af-f675dbdc3137
  RPC_ADDRESS:0.0.0.0
  STATUS:NORMAL,0
  RELEASE_VERSION:1.1.7
  LOAD:1.369775166E9

[15:15:40]dev41:~☠ date
Tue Jan 15 15:16:07 CET 2013
[15:16:20]dev41:~☠ date 02011516
Fri Feb  1 15:16:00 CET 2013
[15:16:00]dev41:~☠ nodetool gossipinfo
/10.0.100.51
  SCHEMA:38349a62-3e49-3b99-84af-f675dbdc3137
  RPC_ADDRESS:0.0.0.0
  STATUS:NORMAL,56713727820156407428984779325531226112
  RELEASE_VERSION:1.1.7
  LOAD:2.671287941E9
/10.0.100.50
  SCHEMA:38349a62-3e49-3b99-84af-f675dbdc3137
  RPC_ADDRESS:0.0.0.0
  STATUS:NORMAL,0
  RELEASE_VERSION:1.1.7
  LOAD:1.369775166E9
{code}
on restarted node it disappears from gossip after time change, and as on our production cluster
the number of threads slowly increases

on non-restarted node it stays in gossip:
{code}
[15:15:06]dev40:~☠ nodetool gossipinfo
/10.0.100.51
  SCHEMA:38349a62-3e49-3b99-84af-f675dbdc3137
  LOAD:2.816511381E9
  RELEASE_VERSION:1.1.7
  RPC_ADDRESS:0.0.0.0
  STATUS:NORMAL,56713727820156407428984779325531226112
/10.0.100.52
  SCHEMA:38349a62-3e49-3b99-84af-f675dbdc3137
  LOAD:6.23579072E8
  RELEASE_VERSION:1.1.7
  RPC_ADDRESS:0.0.0.0
  STATUS:LEFT,149795726937939425132952703051633828837,1358518463257
/10.0.100.50
  SCHEMA:38349a62-3e49-3b99-84af-f675dbdc3137
  LOAD:1.369775166E9
  RELEASE_VERSION:1.1.7
  RPC_ADDRESS:0.0.0.0
  STATUS:NORMAL,0

[15:15:29]dev40:~☠  date 02011516
Fri Feb  1 15:16:00 CET 2013
[15:16:00]dev40:~☠ nodetool gossipinfo
/10.0.100.51
  SCHEMA:38349a62-3e49-3b99-84af-f675dbdc3137
  LOAD:2.671287941E9
  RELEASE_VERSION:1.1.7
  RPC_ADDRESS:0.0.0.0
  STATUS:NORMAL,56713727820156407428984779325531226112
/10.0.100.52
  SCHEMA:38349a62-3e49-3b99-84af-f675dbdc3137
  LOAD:6.23579072E8
  RELEASE_VERSION:1.1.7
  RPC_ADDRESS:0.0.0.0
  STATUS:LEFT,149795726937939425132952703051633828837,1358518463257
/10.0.100.50
  SCHEMA:38349a62-3e49-3b99-84af-f675dbdc3137
  LOAD:1.369775166E9
  RELEASE_VERSION:1.1.7
  RPC_ADDRESS:0.0.0.0
  STATUS:NORMAL,0
{code}

                
> Gossip sends removed node which causes restarted nodes to constantly create new threads
> ---------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-5154
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5154
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.7
>         Environment: centos 6, JVM 1.6.0_37
>            Reporter: Mariusz Gronczewski
>
> Our cassandra cluster had 14 nodes but it was mostly idle so about 2 weeks ago we removed
3 of them (via standard decommision) & moved tokens to balance load.
> Since then no node was restarted but last week after restarting 2 of them we observed
that both of them spawn threads ( WRITE-/1.2.3.4 where 1.2.3.4 is one of removed nodes IPs
) till they hit limit ( which is 800 on our system) and then cassandra dies. Not restarted
nodes do not do that. There are no outgoing connections to those dead nodes
> I noticed dead nodes are still in nodetool gossipinfo on non-restarted nodes but not
on restarted ones so it seems they are not propertly removed from gossip.
> Would rolling restart work for fixing this  or is full cluster stop-start required ?
> trace from hanging threads:
> {code}
>  "WRITE-/1.2.3.4" daemon prio=10 tid=0x00007f5fe8194000 nid=0x2fb2 waiting on
> condition [0x00007f6020de0000]
>    java.lang.Thread.State: WAITING (parking)
> 	at sun.misc.Unsafe.park(Native Method)
> 	- parking to wait for <0x00000007536a1160> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
> 	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
> 	at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:104)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message