cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thibaut (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-3226) Single faulty node brings down entire cluster. No reads/writes possible
Date Mon, 19 Sep 2011 13:53:09 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107824#comment-13107824
] 

Thibaut commented on CASSANDRA-3226:
------------------------------------

Do I have to enable read repair for the failover to work? Aren't host scores being updated
if a command/request doesn't get a valid response? 

I checked the mbean of the dynamic snitch and all hosts have sound values (now where everything
is working). Stopping one node also doesn't trigger this problem, and requests are being routed
to different hosts. The problem only occurs e.g. if the node is in above state, and does seem
to sent back invalid responses? (The handled requests counter increased very fast).

> Single faulty node brings down entire cluster. No reads/writes possible
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-3226
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3226
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.5
>         Environment: linux
>            Reporter: Thibaut
>         Attachments: jstack
>
>
> No client is able to read anything from the entire cluster anymore. This occured a few
times so far, but I can't reproduce it.
> Looks like an OOM directly after starting up the node? Restarting the node "solves" the
issue. I also have to kill the node with -9 because normal kill won't kill the node.
> Healty nodes:
> *.13:
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name                    Active   Pending      Completed
> Commands                        n/a     15237         416868
> Responses                       n/a         0         126721
> *.14.
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name                    Active   Pending      Completed
> Commands                        n/a     15387         437325
> Responses                       n/a         0         131066
> *.15:
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name                    Active   Pending      Completed
> Commands                        n/a     15530         368771
> Responses                       n/a         0         145168
> etc... The pending commands at the healty nodes all increase.
> Faulty node before restart:
> /software/cassandra/bin/nodetool -h localhost info
> Token            : f33
> Gossip active    : true
> Load             : 130.67 GB
> Generation No    : 1316197687
> Uptime (seconds) : 137051
> Heap Memory (MB) : 3580.22 / 3614.00
> Data Center      : datacenter1
> Rack             : rack1
> Exceptions       : 108
> /software/cassandra/bin/nodetool -h localhost netstats
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name                    Active   Pending      Completed
> Commands                        n/a         0       29696566
> Responses                       n/a       560       26650981
> Log excerpt:
> INFO [GossipStage:3] 2011-09-18 09:16:46,254 Gossiper.java (line 713) Node /192.168.0.11
has restarted, now UP again
>  INFO [GossipStage:3] 2011-09-18 09:16:46,255 Gossiper.java (line 681) InetAddress /192.168.0.11
is now UP
>  INFO [GossipStage:3] 2011-09-18 09:16:46,255 StorageService.java (line 815) Node /192.168.0.11
state jump to normal
>  INFO [GossipStage:3] 2011-09-18 09:16:46,257 StorageService.java (line 815) Node /192.168.0.11
state jump to normal
>  INFO [GossipStage:3] 2011-09-18 09:16:54,984 StorageService.java (line 815) Node /192.168.0.6
state jump to normal
>  INFO [GossipStage:3] 2011-09-18 09:16:54,984 Gossiper.java (line 681) InetAddress /192.168.0.6
is now UP
>  INFO [GossipStage:3] 2011-09-18 09:16:56,262 StorageService.java (line 815) Node /192.168.0.18
state jump to normal
>  INFO [GossipStage:3] 2011-09-18 09:16:56,263 Gossiper.java (line 681) InetAddress /192.168.0.18
is now UP
>  INFO [GossipStage:3] 2011-09-18 09:17:06,272 Gossiper.java (line 713) Node /192.168.0.1
has restarted, now UP again
>  INFO [GossipStage:3] 2011-09-18 09:17:06,272 Gossiper.java (line 681) InetAddress /192.168.0.1
is now UP
>  INFO [GossipStage:3] 2011-09-18 09:17:06,272 StorageService.java (line 815) Node /192.168.0.1
state jump to normal
>  INFO [HintedHandoff:1] 2011-09-18 09:20:49,846 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.8
>  INFO [HintedHandoff:1] 2011-09-18 09:20:49,847 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.8
>  INFO [HintedHandoff:1] 2011-09-18 09:21:45,430 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.7
>  INFO [HintedHandoff:1] 2011-09-18 09:21:45,696 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.7
>  INFO [HintedHandoff:1] 2011-09-18 09:21:52,432 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.20
>  INFO [HintedHandoff:1] 2011-09-18 09:21:52,432 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.20
>  INFO [HintedHandoff:1] 2011-09-18 09:22:12,469 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.9
>  INFO [HintedHandoff:1] 2011-09-18 09:22:12,469 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.9
>  INFO [HintedHandoff:1] 2011-09-18 09:23:05,202 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.3
>  INFO [HintedHandoff:1] 2011-09-18 09:23:05,203 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.3
>  INFO [HintedHandoff:1] 2011-09-18 09:23:08,611 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.17
>  INFO [HintedHandoff:1] 2011-09-18 09:23:08,612 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.17
>  INFO [HintedHandoff:1] 2011-09-18 09:23:22,687 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.11
>  INFO [HintedHandoff:1] 2011-09-18 09:23:22,688 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.11
>  INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.6
>  INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.6
> ERROR [Thread-549] 2011-09-18 09:24:26,806 AbstractCassandraDaemon.java (line 139) Fatal
exception in thread Thread[Thread-549,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
>         at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>         at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
>         at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
>  INFO [HintedHandoff:1] 2011-09-18 09:23:05,202 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.3
>  INFO [HintedHandoff:1] 2011-09-18 09:23:05,203 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.3
>  INFO [HintedHandoff:1] 2011-09-18 09:23:08,611 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.17
>  INFO [HintedHandoff:1] 2011-09-18 09:23:08,612 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.17
>  INFO [HintedHandoff:1] 2011-09-18 09:23:22,687 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.11
>  INFO [HintedHandoff:1] 2011-09-18 09:23:22,688 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.11
>  INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.6
>  INFO [HintedHandoff:1] 2011-09-18 09:24:13,051 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.6
> ERROR [Thread-549] 2011-09-18 09:24:26,806 AbstractCassandraDaemon.java (line 139) Fatal
exception in thread Thread[Thread-549,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
>         at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>         at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
>         at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
>  INFO [HintedHandoff:1] 2011-09-18 09:24:34,488 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.18
>  INFO [HintedHandoff:1] 2011-09-18 09:24:34,488 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.18
> ERROR [Thread-554] 2011-09-18 09:24:43,853 AbstractCassandraDaemon.java (line 139) Fatal
exception in thread Thread[Thread-554,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
>         at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>         at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
>         at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
> ERROR [Thread-542] 2011-09-18 09:24:43,924 AbstractCassandraDaemon.java (line 139) Fatal
exception in thread Thread[Thread-542,5,main]
> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
>         at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
>         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>         at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
>         at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:133)
>  INFO [HintedHandoff:1] 2011-09-18 09:25:03,601 HintedHandOffManager.java (line 323)
Started hinted handoff for endpoint /192.168.0.1
>  INFO [HintedHandoff:1] 2011-09-18 09:25:03,601 HintedHandOffManager.java (line 379)
Finished hinted handoff of 0 rows to endpoint /192.168.0.1
> ERROR [Thread-535] 2011-09-18 09:25:03,995 AbstractCassandraDaemon.java (line 139) Fatal
exception in thread Thread[Thread-535,5,main]
> If I restart the node (kill -9), everything works fine again. No OOM!.
> /software/scripts# /software/cassandra/bin/nodetool -h localhost netstats
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name                    Active   Pending      Completed
> Commands                        n/a         0         131351
> Responses                       n/a         0         190697
> /software/scripts# /software/cassandra/bin/nodetool -h localhost info
> Token            : f33
> Gossip active    : true
> Load             : 88.54 GB
> Generation No    : 1316334645
> Uptime (seconds) : 494
> Heap Memory (MB) : 1582.95 / 3614.00
> Data Center      : datacenter1
> Rack             : rack1
> Exceptions       : 0
> /software/scripts# /software/cassandra/bin/nodetool -h localhost ring
> Address         DC          Rack        Status State   Load            Owns    Token
>                                                                                ffffffffffffffff
> 192.168.0.1     datacenter1 rack1       Up     Normal  85.2 GB         0.02%   0cc
> 192.168.0.2     datacenter1 rack1       Up     Normal  86.94 GB        0.02%   199
> 192.168.0.3     datacenter1 rack1       Up     Normal  85.24 GB        0.02%   266
> 192.168.0.4     datacenter1 rack1       Up     Normal  86.38 GB        0.02%   333
> 192.168.0.5     datacenter1 rack1       Up     Normal  86.96 GB        0.02%   400
> 192.168.0.6     datacenter1 rack1       Up     Normal  86.17 GB        0.02%   4cc
> 192.168.0.7     datacenter1 rack1       Up     Normal  83.88 GB        0.02%   599
> 192.168.0.8     datacenter1 rack1       Up     Normal  84.42 GB        0.02%   666
> 192.168.0.9     datacenter1 rack1       Up     Normal  85.06 GB        0.02%   733
> 192.168.0.10    datacenter1 rack1       Up     Normal  83.08 GB        0.02%   7ff
> 192.168.0.11    datacenter1 rack1       Up     Normal  86.22 GB        0.02%   8cc
> 192.168.0.12    datacenter1 rack1       Up     Normal  85.94 GB        0.02%   999
> 192.168.0.13    datacenter1 rack1       Up     Normal  85.01 GB        0.02%   a66
> 192.168.0.14    datacenter1 rack1       Up     Normal  86.5 GB         0.02%   b33
> 192.168.0.15    datacenter1 rack1       Up     Normal  83.33 GB        0.02%   c00
> 192.168.0.16    datacenter1 rack1       Up     Normal  84.41 GB        0.02%   ccc
> 192.168.0.17    datacenter1 rack1       Up     Normal  86.97 GB        28.51%  d99
> 192.168.0.18    datacenter1 rack1       Up     Normal  112.63 GB       41.88%  e66
> 192.168.0.19    datacenter1 rack1       Up     Normal  88.56 GB        29.27%  f33
> 192.168.0.20    datacenter1 rack1       Up     Normal  85.83 GB        0.02%   ffffffffffffffff
> Interestingly, after restart, the node load (from nodetool info) is reduced). Any ideas?
The node doesn't seem to have any hardware memory issues.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message