cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thibaut (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CASSANDRA-3226) Single faulty node brings down entire cluster. No reads/writes possible
Date Sun, 18 Sep 2011 08:48:08 GMT
Single faulty node brings down entire cluster. No reads/writes possible
-----------------------------------------------------------------------

                 Key: CASSANDRA-3226
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3226
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.8.5
         Environment: linux
            Reporter: Thibaut


No client is able to read anything from the entire cluster anymore. This occured a few times
so far, but I can't reproduce it.

Looks like an OOM directly after starting up the node? Restarting the node "solves" the issue.
I also have to kill the node with -9 because normal kill won't kill the node.

Healty nodes:

*.13:
Mode: Normal
Not sending any streams.
Not receiving any streams.
Pool Name                    Active   Pending      Completed
Commands                        n/a     15237         416868
Responses                       n/a         0         126721

*.14.
Mode: Normal
Not sending any streams.
Not receiving any streams.
Pool Name                    Active   Pending      Completed
Commands                        n/a     15387         437325
Responses                       n/a         0         131066
*.15:
Mode: Normal
Not sending any streams.
Not receiving any streams.
Pool Name                    Active   Pending      Completed
Commands                        n/a     15530         368771
Responses                       n/a         0         145168


etc... The pending commands at the healty nodes all increase.


Faulty node before restart:

/software/cassandra/bin/nodetool -h localhost info
Token            : f33
Gossip active    : true
Load             : 130.67 GB
Generation No    : 1316197687
Uptime (seconds) : 137051
Heap Memory (MB) : 3580.22 / 3614.00
Data Center      : datacenter1
Rack             : rack1
Exceptions       : 108

/software/cassandra/bin/nodetool -h localhost netstats
Mode: Normal
Not sending any streams.
Not receiving any streams.
Pool Name                    Active   Pending      Completed
Commands                        n/a         0       29696566
Responses                       n/a       560       26650981

Log excerpt:

INFO [GossipStage:3] 2011-09-18 09:16:46,254 Gossiper.java (line 713) Node /192.168.0.11 has
restarted, now UP again
 INFO [GossipStage:3] 2011-09-18 09:16:46,255 Gossiper.java (line 681) InetAddress /192.168.0.11
is now UP
 INFO [GossipStage:3] 2011-09-18 09:16:46,255 StorageService.java (line 815) Node /192.168.0.11
state jump to normal
 INFO [GossipStage:3] 2011-09-18 09:16:46,257 StorageService.java (line 815) Node /192.168.0.11
state jump to normal
 INFO [GossipStage:3] 2011-09-18 09:16:54,984 StorageService.java (line 815) Node /192.168.0.6
state jump to normal
 INFO [GossipStage:3] 2011-09-18 09:16:54,984 Gossiper.java (line 681) InetAddress /192.168.0.6
is now UP
 INFO [GossipStage:3] 2011-09-18 09:16:56,262 StorageService.java (line 815) Node /192.168.0.18
state jump to normal
 INFO [GossipStage:3] 2011-09-18 09:16:56,263 Gossiper.java (line 681) InetAddress /192.168.0.18
is now UP
 INFO [GossipStage:3] 2011-09-18 09:17:06,272 Gossiper.java (line 713) Node /192.168.0.1 has
restarted, now UP again
 INFO [GossipStage:3] 2011-09-18 09:17:06,272 Gossiper.java (line 681) InetAddress /192.168.0.1
is now UP
 INFO [GossipStage:3] 2011-09-18 09:17:06,272 StorageService.java (line 815) Node /192.168.0.1
state jump to normal
 INFO [HintedHandoff:1] 2011-09-18 09:20:49,846 HintedHandOffManager.java (line 323) Started
hinted handoff for endpoint /192.168.0.8
 INFO [HintedHandoff:1] 2011-09-18 09:20:49,847 HintedHandOffManager.java (line 379) Finished
hinted handoff of 0 rows to endpoint /192.168.0.8
 INFO [HintedHandoff:1] 2011-09-18 09:21:45,430 HintedHandOffManager.java (line 323) Started
hinted handoff for endpoint /192.168.0.7
 INFO [HintedHandoff:1] 2011-09-18 09:21:45,696 HintedHandOffManager.java (line 379) Finished
hinted handoff of 0 rows to endpoint /192.168.0.7
 INFO [HintedHandoff:1] 2011-09-18 09:21:52,432 HintedHandOffManager.java (line 323) Started
hinted handoff for endpoint /192.168.0.20
 INFO [HintedHandoff:1] 2011-09-18 09:21:52,432 HintedHandOffManager.java (line 379) Finished
hinted handoff of 0 rows to endpoint /192.168.0.20
 INFO [HintedHandoff:1] 2011-09-18 09:22:12,469 HintedHandOffManager.java (line 323) Started
hinted handoff for endpoint /192.168.0.9
 INFO [HintedHandoff:1] 2011-09-18 09:22:12,469 HintedHandOffManager.java (line 379) Finished
hinted handoff of 0 rows to endpoint /192.168.0.9
 INFO [HintedHandoff:1] 2011-09-18 09:23:05,202 HintedHandOffManager.java (line 323) Started
hinted handoff for endpoint /192.168.0.3
 INFO [HintedHandoff:1] 2011-09-18 09:23:05,203 HintedHandOffManager.java (line 379) Finished
hinted handoff of 0 rows to endpoint /192.168.0.3
 INFO [HintedHandoff:1] 2011-09-18 09:23:08,611 HintedHandOffManager.java (line 323) Started
hinted handoff for endpoint /192.168.0.17
 INFO [HintedHandoff:1] 2011-09-18 09:23:08,612 HintedHandOffManager.java (line 379) Finished
hinted handoff of 0 rows to endpoint /192.168.0.17
 INFO [HintedHandoff:1] 2011-09-18 09:23:22,687 HintedHandOffManager.java (line 323) Started
hinted handoff for endpoint /192.168.0.11
 INFO [HintedHandoff:1] 2011-09-18 09:23:22,688 HintedHandOffManager.java (line 379) Finished
hinted handoff of 0 rows to endpoint /192.168.0.11
 INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 323) Started
hinted handoff for endpoint /192.168.0.6
 INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 379) Finished
hinted handoff of 0 rows to endpoint /192.168.0.6
ERROR [Thread-549] 2011-09-18 09:24:26,806 AbstractCassandraDaemon.java (line 139) Fatal exception
in thread Thread[Thread-549,5,main]
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
        at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
        at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
        at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
 INFO [HintedHandoff:1] 2011-09-18 09:23:05,202 HintedHandOffManager.java (line 323) Started
hinted handoff for endpoint /192.168.0.3
 INFO [HintedHandoff:1] 2011-09-18 09:23:05,203 HintedHandOffManager.java (line 379) Finished
hinted handoff of 0 rows to endpoint /192.168.0.3
 INFO [HintedHandoff:1] 2011-09-18 09:23:08,611 HintedHandOffManager.java (line 323) Started
hinted handoff for endpoint /192.168.0.17
 INFO [HintedHandoff:1] 2011-09-18 09:23:08,612 HintedHandOffManager.java (line 379) Finished
hinted handoff of 0 rows to endpoint /192.168.0.17
 INFO [HintedHandoff:1] 2011-09-18 09:23:22,687 HintedHandOffManager.java (line 323) Started
hinted handoff for endpoint /192.168.0.11
 INFO [HintedHandoff:1] 2011-09-18 09:23:22,688 HintedHandOffManager.java (line 379) Finished
hinted handoff of 0 rows to endpoint /192.168.0.11
 INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 323) Started
hinted handoff for endpoint /192.168.0.6
 INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 379) Finished
hinted handoff of 0 rows to endpoint /192.168.0.6
ERROR [Thread-549] 2011-09-18 09:24:26,806 AbstractCassandraDaemon.java (line 139) Fatal exception
in thread Thread[Thread-549,5,main]
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
        at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
        at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
        at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
 INFO [HintedHandoff:1] 2011-09-18 09:24:34,488 HintedHandOffManager.java (line 323) Started
hinted handoff for endpoint /192.168.0.18
 INFO [HintedHandoff:1] 2011-09-18 09:24:34,488 HintedHandOffManager.java (line 379) Finished
hinted handoff of 0 rows to endpoint /192.168.0.18
ERROR [Thread-554] 2011-09-18 09:24:43,853 AbstractCassandraDaemon.java (line 139) Fatal exception
in thread Thread[Thread-554,5,main]
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
        at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
        at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
        at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
ERROR [Thread-542] 2011-09-18 09:24:43,924 AbstractCassandraDaemon.java (line 139) Fatal exception
in thread Thread[Thread-542,5,main]
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
        at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
        at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
        at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
 INFO [HintedHandoff:1] 2011-09-18 09:25:03,601 HintedHandOffManager.java (line 323) Started
hinted handoff for endpoint /192.168.0.1
 INFO [HintedHandoff:1] 2011-09-18 09:25:03,601 HintedHandOffManager.java (line 379) Finished
hinted handoff of 0 rows to endpoint /192.168.0.1
ERROR [Thread-535] 2011-09-18 09:25:03,995 AbstractCassandraDaemon.java (line 139) Fatal exception
in thread Thread[Thread-535,5,main]






If I restart the node (kill -9), everything works fine again. No OOM!.

/software/scripts# /software/cassandra/bin/nodetool -h localhost netstats
Mode: Normal
Not sending any streams.
Not receiving any streams.
Pool Name                    Active   Pending      Completed
Commands                        n/a         0         131351
Responses                       n/a         0         190697

/software/scripts# /software/cassandra/bin/nodetool -h localhost info
Token            : f33
Gossip active    : true
Load             : 88.54 GB
Generation No    : 1316334645
Uptime (seconds) : 494
Heap Memory (MB) : 1582.95 / 3614.00
Data Center      : datacenter1
Rack             : rack1
Exceptions       : 0

/software/scripts# /software/cassandra/bin/nodetool -h localhost ring
Address         DC          Rack        Status State   Load            Owns    Token
                                                                               ffffffffffffffff
192.168.0.1     datacenter1 rack1       Up     Normal  85.2 GB         0.02%   0cc
192.168.0.2     datacenter1 rack1       Up     Normal  86.94 GB        0.02%   199
192.168.0.3     datacenter1 rack1       Up     Normal  85.24 GB        0.02%   266
192.168.0.4     datacenter1 rack1       Up     Normal  86.38 GB        0.02%   333
192.168.0.5     datacenter1 rack1       Up     Normal  86.96 GB        0.02%   400
192.168.0.6     datacenter1 rack1       Up     Normal  86.17 GB        0.02%   4cc
192.168.0.7     datacenter1 rack1       Up     Normal  83.88 GB        0.02%   599
192.168.0.8     datacenter1 rack1       Up     Normal  84.42 GB        0.02%   666
192.168.0.9     datacenter1 rack1       Up     Normal  85.06 GB        0.02%   733
192.168.0.10    datacenter1 rack1       Up     Normal  83.08 GB        0.02%   7ff
192.168.0.11    datacenter1 rack1       Up     Normal  86.22 GB        0.02%   8cc
192.168.0.12    datacenter1 rack1       Up     Normal  85.94 GB        0.02%   999
192.168.0.13    datacenter1 rack1       Up     Normal  85.01 GB        0.02%   a66
192.168.0.14    datacenter1 rack1       Up     Normal  86.5 GB         0.02%   b33
192.168.0.15    datacenter1 rack1       Up     Normal  83.33 GB        0.02%   c00
192.168.0.16    datacenter1 rack1       Up     Normal  84.41 GB        0.02%   ccc
192.168.0.17    datacenter1 rack1       Up     Normal  86.97 GB        28.51%  d99
192.168.0.18    datacenter1 rack1       Up     Normal  112.63 GB       41.88%  e66
192.168.0.19    datacenter1 rack1       Up     Normal  88.56 GB        29.27%  f33
192.168.0.20    datacenter1 rack1       Up     Normal  85.83 GB        0.02%   ffffffffffffffff


Interestingly, after restart, the node load (from nodetool info) is reduced). Any ideas? The
node doesn't seem to have any hardware memory issues.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message