cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Podkowinski (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Deleted] (CASSANDRA-13058) dtest failure in hintedhandoff_test.TestHintedHandoff.hintedhandoff_decom_test
Date Mon, 09 Jan 2017 11:26:58 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-13058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stefan Podkowinski updated CASSANDRA-13058:
-------------------------------------------
    Comment: was deleted

(was: Edit: looks like I did something wrong while reverting the mentioned commit. It does
not effect the test result. Sorry for that.
I still want to leave the rest of my comment here as it may indicate that this could be indeed
related to some buggy gossip state/failure detection handling.

-This looks like a regression caused by CASSANDRA-12192 to me-. -The test seems to work fine
without the commit.-

Take a look at this dtest log:

{{./run_dtests.py --vnodes false --nose-options -v  hintedhandoff_test.py:TestHintedHandoff.hintedhandoff_decom_test}}

{noformat}
Node 4:
DEBUG [StorageServiceShutdownHook] 2017-01-04 15:06:16,748 StorageService.java:1437 - DRAINING:
starting drain process                           
DEBUG [StorageServiceShutdownHook] 2017-01-04 15:06:20,011 StorageService.java:1437 - DRAINED
                                                   

Node 1:
INFO  [GossipStage:1] 2017-01-04 15:06:16,756 Gossiper.java:1036 - InetAddress /127.0.0.4
is now DOWN                                           
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 OutboundTcpConnection.java:180 - Enqueuing socket
close for /127.0.0.4
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 OutboundTcpConnection.java:180 - Enqueuing socket
close for /127.0.0.4
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 OutboundTcpConnection.java:180 - Enqueuing socket
close for /127.0.0.4
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 FailureDetector.java:325 - Forcing conviction
of /127.0.0.4                                       
DEBUG [GossipStage:1] 2017-01-04 15:06:16,758 Gossiper.java:370 - /127.0.0.4 marked as alive:
false                                             
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:16,758 OutboundTcpConnection.java:374
- Socket to /127.0.0.4 closed        
DEBUG [MessagingService-Outgoing-/127.0.0.4-Small] 2017-01-04 15:06:16,759 OutboundTcpConnection.java:374
- Socket to /127.0.0.4 closed
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:17,883 OutboundTcpConnection.java:388
- Attempting to connect to /127.0.0.4
INFO  [HANDSHAKE-/127.0.0.4] 2017-01-04 15:06:17,884 OutboundTcpConnection.java:510 - Handshaking
version with /127.0.0.4
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:17,884 OutboundTcpConnection.java:482
- Done connecting to /127.0.0.4
DEBUG [Native-Transport-Requests-3] 2017-01-04 15:06:20,825 FailureDetector.java:252 - /127.0.0.4
is ALIVE
INFO  [RMI TCP Connection(2)-127.0.0.1] 2017-01-04 15:06:52,930 StorageService.java:1435 -
LEAVING: streaming hints to other nodes              
INFO  [HintsDispatcher:2] 2017-01-04 15:06:52,955 HintsDispatchExecutor.java:152 - Transferring
all hints to 054006b5-412c-4d4e-a28a-7610574de79     d                                   
                                                                                         
                 
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:53,035 OutboundTcpConnection.java:495
- Unable to connect to /127.0.0.4
java.net.ConnectException: Connection refused
▸  at sun.nio.ch.Net.connect0(Native Method) ~[na:1.8.0_112]
▸  at sun.nio.ch.Net.connect(Net.java:454) ~[na:1.8.0_112]
▸  at sun.nio.ch.Net.connect(Net.java:446) ~[na:1.8.0_112]
▸  at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648) ~[na:1.8.0_112]
▸  at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:146)
~[main/:na]
▸  at org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:132)
~[main/:na]
▸  at org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:397)
[main/:na]
▸  at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:234)
[main/:na]
{noformat}


I've added some additional log messages, so don't get confused when you can't find some of
them in the code. What's happening here is that the gossiper will down node1 and will trigger
{{OutboundTcpConnectionPool.reset()}} via {{MessagingService.convict()}}. This will cause
a reconnect shortly, as the message should now always be retried from the backlog again. 

Unfortunately I can't see which message is affected here. But my assumption is that node4
keeps accepting messages even for a short time after gossip change propagation and will return
a response to node1, which will flag node4 alive again in the process. But I'm not exactly
sure, as the node status related code is not easy to understand with all the side effects
and unclear state handling. :(

/cc [~thobbs]
)

> dtest failure in hintedhandoff_test.TestHintedHandoff.hintedhandoff_decom_test
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13058
>             Project: Cassandra
>          Issue Type: Test
>          Components: Testing
>            Reporter: Sean McCarthy
>            Priority: Blocker
>              Labels: dtest, test-failure
>             Fix For: 3.10
>
>         Attachments: node1.log, node1_debug.log, node1_gc.log, node2.log, node2_debug.log,
node2_gc.log, node3.log, node3_debug.log, node3_gc.log, node4.log, node4_debug.log, node4_gc.log
>
>
> example failure:
> http://cassci.datastax.com/job/cassandra-3.X_novnode_dtest/16/testReport/hintedhandoff_test/TestHintedHandoff/hintedhandoff_decom_test/
> {code}
> Error Message
> Subprocess ['nodetool', '-h', 'localhost', '-p', '7100', ['decommission']] exited with
non-zero status; exit status: 2; 
> stderr: error: Error while decommissioning node: Failed to transfer all hints to 59f20b4f-0215-4e18-be1b-7e00f2901629
> {code}{code}
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
>     testMethod()
>   File "/home/automaton/cassandra-dtest/hintedhandoff_test.py", line 167, in hintedhandoff_decom_test
>     node1.decommission()
>   File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 1314, in decommission
>     self.nodetool("decommission")
>   File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 783, in nodetool
>     return handle_external_tool_process(p, ['nodetool', '-h', 'localhost', '-p', str(self.jmx_port),
cmd.split()])
>   File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 1993, in handle_external_tool_process
>     raise ToolError(cmd_args, rc, out, err)
> {code}{code}
> java.lang.RuntimeException: Error while decommissioning node: Failed to transfer all
hints to 59f20b4f-0215-4e18-be1b-7e00f2901629
> 	at org.apache.cassandra.service.StorageService.decommission(StorageService.java:3924)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:497)
> 	at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:497)
> 	at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
> 	at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
> 	at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
> 	at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
> 	at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
> 	at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
> 	at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
> 	at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
> 	at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1466)
> 	at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76)
> 	at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1307)
> 	at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1399)
> 	at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:828)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:497)
> 	at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:323)
> 	at sun.rmi.transport.Transport$1.run(Transport.java:200)
> 	at sun.rmi.transport.Transport$1.run(Transport.java:197)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
> 	at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:568)
> 	at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
> 	at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$241(TCPTransport.java:683)
> 	at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$284/1694175644.run(Unknown
Source)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message