cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-3226) Single faulty node brings down entire cluster. No reads/writes possible
Date Sun, 18 Sep 2011 09:46:09 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107404#comment-13107404
] 

Peter Schuller commented on CASSANDRA-3226:
-------------------------------------------

What is your replication factor, and what consistency level are you reading/writing at?

When you say "no client is able to read", (1) by what mechanism? what is the error? is it
an UnavailableException for example? (2) for how long does this persist; is it more than about
15-30 seconds?

I assume that the load in your ring output accurately represents the balancing of the cluster
due to use of ordered partitioner, and that the 0.02% token space ownership is to be ignored.

Also I'm not really sure what you mean about OOM:s. If you get an OOM the node should restart
as a result of exiting, so "have to restart it" doesn't seem to make sense. What do you mean
by "looks like an OOM" - what is your actual observation to believe it's an OOM? Have you
seen an OOM stacktrace somewhere?


> Single faulty node brings down entire cluster. No reads/writes possible
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-3226
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3226
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.5
>         Environment: linux
>            Reporter: Thibaut
>         Attachments: jstack
>
>
> No client is able to read anything from the entire cluster anymore. This occured a few
times so far, but I can't reproduce it.
> Looks like an OOM directly after starting up the node? Restarting the node "solves" the
issue. I also have to kill the node with -9 because normal kill won't kill the node.
> Healty nodes:
> *.13:
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name                    Active   Pending      Completed
> Commands                        n/a     15237         416868
> Responses                       n/a         0         126721
> *.14.
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name                    Active   Pending      Completed
> Commands                        n/a     15387         437325
> Responses                       n/a         0         131066
> *.15:
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name                    Active   Pending      Completed
> Commands                        n/a     15530         368771
> Responses                       n/a         0         145168
> etc... The pending commands at the healty nodes all increase.
> Faulty node before restart:
> /software/cassandra/bin/nodetool -h localhost info
> Token            : f33
> Gossip active    : true
> Load             : 130.67 GB
> Generation No    : 1316197687
> Uptime (seconds) : 137051
> Heap Memory (MB) : 3580.22 / 3614.00
> Data Center      : datacenter1
> Rack             : rack1
> Exceptions       : 108
> /software/cassandra/bin/nodetool -h localhost netstats
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name                    Active   Pending      Completed
> Commands                        n/a         0       29696566
> Responses                       n/a       560       26650981
> Log excerpt:
> INFO [GossipStage:3] 2011-09-18 09:16:46,254 Gossiper.java (line 713) Node /192.168.0.11
has restarted, now UP again
>  INFO [GossipStage:3] 2011-09-18 09:16:46,255 Gossiper.java (line 681) InetAddress /192.168.0.11
is now UP
>  INFO [GossipStage:3] 2011-09-18 09:16:46,255 StorageService.java (line 815) Node /192.168.0.11
state jump to normal
>  INFO [GossipStage:3] 2011-09-18 09:16:46,257 StorageService.java (line 815) Node /192.168.0.11
state jump to normal
>  INFO [GossipStage:3] 2011-09-18 09:16:54,984 StorageService.java (line 815) Node /192.168.0.6
state jump to normal
>  INFO [GossipStage:3] 2011-09-18 09:16:54,984 Gossiper.java (line 681) InetAddress /192.168.0.6
is now UP
>  INFO [GossipStage:3] 2011-09-18 09:16:56,262 StorageService.java (line 815) Node /192.168.0.18
state jump to normal
>  INFO [GossipStage:3] 2011-09-18 09:16:56,263 Gossiper.java (line 681) InetAddress /192.168.0.18
is now UP
>  INFO [GossipStage:3] 2011-09-18 09:17:06,272 Gossiper.java (line 713) Node /192.168.0.1
has restarted, now UP again
>  INFO [GossipStage:3] 2011-09-18 09:17:06,272 Gossiper.java (line 681) InetAddress /192.168.0.1
is now UP
>  INFO [GossipStage:3] 2011-09-18 09:17:06,272 StorageService.java (line 815) Node /192.168.0.1
state jump to normal
>  INFO [HintedHandoff:1] 2011-09-18 09:20:49,846 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.8
>  INFO [HintedHandoff:1] 2011-09-18 09:20:49,847 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.8
>  INFO [HintedHandoff:1] 2011-09-18 09:21:45,430 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.7
>  INFO [HintedHandoff:1] 2011-09-18 09:21:45,696 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.7
>  INFO [HintedHandoff:1] 2011-09-18 09:21:52,432 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.20
>  INFO [HintedHandoff:1] 2011-09-18 09:21:52,432 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.20
>  INFO [HintedHandoff:1] 2011-09-18 09:22:12,469 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.9
>  INFO [HintedHandoff:1] 2011-09-18 09:22:12,469 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.9
>  INFO [HintedHandoff:1] 2011-09-18 09:23:05,202 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.3
>  INFO [HintedHandoff:1] 2011-09-18 09:23:05,203 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.3
>  INFO [HintedHandoff:1] 2011-09-18 09:23:08,611 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.17
>  INFO [HintedHandoff:1] 2011-09-18 09:23:08,612 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.17
>  INFO [HintedHandoff:1] 2011-09-18 09:23:22,687 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.11
>  INFO [HintedHandoff:1] 2011-09-18 09:23:22,688 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.11
>  INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.6
>  INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.6
> ERROR [Thread-549] 2011-09-18 09:24:26,806 AbstractCassandraDaemon.java (line 139) Fatal
exception in thread Thread[Thread-549,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
>         at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>         at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
>         at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
>  INFO [HintedHandoff:1] 2011-09-18 09:23:05,202 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.3
>  INFO [HintedHandoff:1] 2011-09-18 09:23:05,203 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.3
>  INFO [HintedHandoff:1] 2011-09-18 09:23:08,611 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.17
>  INFO [HintedHandoff:1] 2011-09-18 09:23:08,612 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.17
>  INFO [HintedHandoff:1] 2011-09-18 09:23:22,687 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.11
>  INFO [HintedHandoff:1] 2011-09-18 09:23:22,688 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.11
>  INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.6
>  INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.6
> ERROR [Thread-549] 2011-09-18 09:24:26,806 AbstractCassandraDaemon.java (line 139) Fatal
exception in thread Thread[Thread-549,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
>         at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>         at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
>         at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
>  INFO [HintedHandoff:1] 2011-09-18 09:24:34,488 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.18
>  INFO [HintedHandoff:1] 2011-09-18 09:24:34,488 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.18
> ERROR [Thread-554] 2011-09-18 09:24:43,853 AbstractCassandraDaemon.java (line 139) Fatal
exception in thread Thread[Thread-554,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
>         at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>         at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
>         at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
> ERROR [Thread-542] 2011-09-18 09:24:43,924 AbstractCassandraDaemon.java (line 139) Fatal
exception in thread Thread[Thread-542,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
>         at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>         at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
>         at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
>  INFO [HintedHandoff:1] 2011-09-18 09:25:03,601 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.1
>  INFO [HintedHandoff:1] 2011-09-18 09:25:03,601 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.1
> ERROR [Thread-535] 2011-09-18 09:25:03,995 AbstractCassandraDaemon.java (line 139) Fatal
exception in thread Thread[Thread-535,5,main]
> If I restart the node (kill -9), everything works fine again. No OOM!.
> /software/scripts# /software/cassandra/bin/nodetool -h localhost netstats
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name                    Active   Pending      Completed
> Commands                        n/a         0         131351
> Responses                       n/a         0         190697
> /software/scripts# /software/cassandra/bin/nodetool -h localhost info
> Token            : f33
> Gossip active    : true
> Load             : 88.54 GB
> Generation No    : 1316334645
> Uptime (seconds) : 494
> Heap Memory (MB) : 1582.95 / 3614.00
> Data Center      : datacenter1
> Rack             : rack1
> Exceptions       : 0
> /software/scripts# /software/cassandra/bin/nodetool -h localhost ring
> Address         DC          Rack        Status State   Load            Owns    Token
>                                                                                ffffffffffffffff
> 192.168.0.1     datacenter1 rack1       Up     Normal  85.2 GB         0.02%   0cc
> 192.168.0.2     datacenter1 rack1       Up     Normal  86.94 GB        0.02%   199
> 192.168.0.3     datacenter1 rack1       Up     Normal  85.24 GB        0.02%   266
> 192.168.0.4     datacenter1 rack1       Up     Normal  86.38 GB        0.02%   333
> 192.168.0.5     datacenter1 rack1       Up     Normal  86.96 GB        0.02%   400
> 192.168.0.6     datacenter1 rack1       Up     Normal  86.17 GB        0.02%   4cc
> 192.168.0.7     datacenter1 rack1       Up     Normal  83.88 GB        0.02%   599
> 192.168.0.8     datacenter1 rack1       Up     Normal  84.42 GB        0.02%   666
> 192.168.0.9     datacenter1 rack1       Up     Normal  85.06 GB        0.02%   733
> 192.168.0.10    datacenter1 rack1       Up     Normal  83.08 GB        0.02%   7ff
> 192.168.0.11    datacenter1 rack1       Up     Normal  86.22 GB        0.02%   8cc
> 192.168.0.12    datacenter1 rack1       Up     Normal  85.94 GB        0.02%   999
> 192.168.0.13    datacenter1 rack1       Up     Normal  85.01 GB        0.02%   a66
> 192.168.0.14    datacenter1 rack1       Up     Normal  86.5 GB         0.02%   b33
> 192.168.0.15    datacenter1 rack1       Up     Normal  83.33 GB        0.02%   c00
> 192.168.0.16    datacenter1 rack1       Up     Normal  84.41 GB        0.02%   ccc
> 192.168.0.17    datacenter1 rack1       Up     Normal  86.97 GB        28.51%  d99
> 192.168.0.18    datacenter1 rack1       Up     Normal  112.63 GB       41.88%  e66
> 192.168.0.19    datacenter1 rack1       Up     Normal  88.56 GB        29.27%  f33
> 192.168.0.20    datacenter1 rack1       Up     Normal  85.83 GB        0.02%   ffffffffffffffff
> Interestingly, after restart, the node load (from nodetool info) is reduced). Any ideas?
The node doesn't seem to have any hardware memory issues.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message