cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dinesh Joshi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-14362) Node unable to join cluster in trunk
Date Sun, 08 Apr 2018 11:08:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-14362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429722#comment-16429722
] 

Dinesh Joshi commented on CASSANDRA-14362:
------------------------------------------

[~jasobrown] / [~aweisberg] I think I know what the issue is. [~jasobrown] not only changed
the {{OutboundConnectionIdentifier}} to prefer the local address of the node but also removed
the binding to a specific local IP address during {{NettyFactory#createOutboundBootstrap}}
setup. This had an unintended consequence. When you leave the {{Socket}}'s source IP (local
address) unset, the kernel will choose one for you. This normally is not an issue. However,
during the dtest run all nodes are local and rely on 127.0.0.x IPs. When the C* nodes try
to setup the streaming connections, they get tripped up due to this change.

Say node1 (127.0.0.1) & node2 (127.0.0.2) are started up by the dtest (I know in reality
its 3 nodes but for the sake of simplicity I'm limiting it to two nodes). node1 attempts to
create an outbound connection for streaming with node2, the outbound socket should actually
be {{127.0.0.1:ANY_PORT -> 127.0.0.2:INTERNODE_PORT}}. However, per this patch, we leave
the source IP unbound. When the connection is established and node1 sends the {{FirstHandshakeMessage}},
it has a source IP set to set to an arbitrarily chose IP by the kernel - in this case it was
{{127.0.0.2}} (when it should actually be {{127.0.0.1}}).

The receiving node (node2) gets this message {{InboundHandshakeHandler#handleStart}}, decodes
it fine but when it goes to {{InboundHandshakeHandler#setupStreamingPipeline}}, it creates
a {{StreamingInboundHandler}} with remote IP set to {{127.0.0.2}}. This messes up the streaming
pipeline because now node2 is waiting on {{127.0.0.2}} (itself) for data. Hence we find it
stuck in the bootstrap phase -
{noformat}
"Stream-Deserializer-127.0.0.2:7000-4004ef3b" #133 daemon prio=5 os_prio=35 tid=0x00007ff58c601ca0
nid=0xdc03 waiting on condition [0x0000700007613000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at java.lang.Thread.sleep(Thread.java:340)
	at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)
	at com.google.common.util.concurrent.Uninterruptibles.sleepUninterruptibly(Uninterruptibles.java:285)
	at org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:174)
	at java.lang.Thread.run(Thread.java:748)

"main" #1 prio=5 os_prio=35 tid=0x00007ff58c706000 nid=0x2803 waiting on condition [0x0000700004e5e000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000007a11a3f98> (a org.apache.cassandra.streaming.StreamResultFuture)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:497)
	at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1519)
	at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:949)
	at org.apache.cassandra.service.StorageService.initServer(StorageService.java:651)
	- locked <0x00000007a40f80b8> (a org.apache.cassandra.service.StorageService)
	at org.apache.cassandra.service.StorageService.initServer(StorageService.java:586)
	- locked <0x00000007a40f80b8> (a org.apache.cassandra.service.StorageService)
	at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:367)
	at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:590)
	at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:679)
{noformat}
I also noted this error in the logs which further points to the wrong IP for the peer being
used -
{noformat}
ERROR [Stream-Deserializer-127.0.0.2:53830-7b781083] 2018-04-06 13:19:25,863 StreamingInboundHandler.java:210
- [Stream channel: 7b781083] stream operation from 127.0.0.2:53830 failed
java.lang.IllegalArgumentException: Unknown peer requested: 127.0.0.2:7000
        at org.apache.cassandra.streaming.StreamCoordinator.getHostData(StreamCoordinator.java:242)
        at org.apache.cassandra.streaming.StreamCoordinator.getSessionById(StreamCoordinator.java:170)
        at org.apache.cassandra.streaming.StreamResultFuture.getSession(StreamResultFuture.java:237)
        at org.apache.cassandra.streaming.StreamManager.findSession(StreamManager.java:194)
        at org.apache.cassandra.streaming.StreamManager.findSession(StreamManager.java:185)
        at org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:41)
        at org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
        at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:55)
        at org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:177)
        at java.lang.Thread.run(Thread.java:748)
{noformat}
I added a bit of debugging code and turned {{TRACE}} logging on and confirmed that this was
indeed the case.

Below is the output of node2's debug.log in the 3 node cluster that dtest run created. node1,
2 & 3 were assigned IPs {{127.0.0.1, 127.0.0.2, 127.0.0.3}} respectively. Note that node2's
actual IP is never used as we've left it unbound so it seems the OS kernel prefers the IP
that it thinks is closest to the remote IP so channel local & remote as the showing up
as the same IP.
{noformat}
03:07 $ tail -F /var/folders/5f/_29kl4f524l7gp__02cxby_c0000gn/T/dtest-wu9pai7a/test/node2/logs/debug.log
 | grep handshake
TRACE [MessagingService-NettyOutbound-Thread-4-1] 2018-04-08 03:07:55,360 OutboundHandshakeHandler.java:107
- starting handshake with peer 127.0.0.1:7000, msg = FirstHandshakeMessage - messaging version:
12, mode: MESSAGING, compress: false, channel local = /127.0.0.1:65020 remote = /127.0.0.1:7000
TRACE [MessagingService-NettyOutbound-Thread-4-3] 2018-04-08 03:09:21,427 OutboundHandshakeHandler.java:107
- starting handshake with peer 127.0.0.3:7000, msg = FirstHandshakeMessage - messaging version:
12, mode: MESSAGING, compress: false, channel local = /127.0.0.3:65091 remote = /127.0.0.3:7000
{noformat}
I only modified the trace log line in {{OutboundHandshakeHandler#channelActive}} as below
-
{noformat}
        logger.trace("starting handshake with peer {}, msg = {}, channel local = {} remote
= {}", connectionId.connectionAddress(),
                     msg, ctx.channel().localAddress(), ctx.channel().remoteAddress());
{noformat}

> Node unable to join cluster in trunk
> ------------------------------------
>
>                 Key: CASSANDRA-14362
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14362
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Lerh Chuan Low
>            Assignee: Jason Brown
>            Priority: Major
>             Fix For: 4.0
>
>
> I'm not sure if anybody has tried using {{trunk}} and starting an actual EC2 cluster,
but with a 3 node setup that works on 3.11, it fails to work on {{trunk}}. I mistakenly stumbled
into it while testing my own code changes. 
> My setup is as follows:
> broadcast_rpc_address: publicIP
> broadcast_address: publicIP
> listen_address: omitted. Ends up as privateIP. 
> Works on 3.11 just fine. 
> On {{trunk}} though, it never works. My node is never able to join the cluster:
> {code}
> Apr 03 05:57:06 ip-10-0-47-122 cassandra[13914]: INFO  [main] 2018-04-03 05:57:05,895
RangeStreamer.java:195 - Bootstrap: range (-128373781239966537,-122439194129870521] exists
on 52.88.241.181:7000 for keyspace system_traces
> Apr 03 05:57:06 ip-10-0-47-122 cassandra[13914]: INFO  [main] 2018-04-03 05:57:05,895
RangeStreamer.java:195 - Bootstrap: range (6968670424536541270,6973888347502882935] exists
on 52.88.241.181:7000 for keyspace system_traces
> Apr 03 05:57:42 ip-10-0-47-122 cassandra[13914]: WARN  [GossipTasks:1] 2018-04-03 05:57:42,298
FailureDetector.java:324 - Not marking nodes down due to local pause of 26215173446ns >
5000000000ns
> Apr 03 05:57:53 ip-10-0-47-122 cassandra[13914]: WARN  [GossipTasks:1] 2018-04-03 05:57:53,035
FailureDetector.java:324 - Not marking nodes down due to local pause of 10736485907ns >
5000000000ns
> Apr 03 05:58:30 ip-10-0-47-122 cassandra[13914]: WARN  [GossipTasks:1] 2018-04-03 05:58:30,790
Gossiper.java:814 - Gossip stage has 28 pending tasks; skipping status check (no nodes will
be marked down)
> Apr 03 05:58:33 ip-10-0-47-122 cassandra[13914]: WARN  [GossipTasks:1] 2018-04-03 05:58:33,060
Gossiper.java:814 - Gossip stage has 20 pending tasks; skipping status check (no nodes will
be marked down)
> Apr 03 06:04:33 ip-10-0-47-122 cassandra[13914]: WARN  [GossipTasks:1] 2018-04-03 06:04:33,826
FailureDetector.java:324 - Not marking nodes down due to local pause of 400790432954ns >
5000000000ns
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: WARN  [GossipTasks:1] 2018-04-03 06:04:49,133
Gossiper.java:814 - Gossip stage has 2 pending tasks; skipping status check (no nodes will
be marked down)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: ERROR [StreamConnectionEstablisher:1]
2018-04-03 06:04:49,138 StreamSession.java:524 - [Stream #d4cd6420-3703-11e8-a6a5-e51ddc10cfe6]
Streaming error occurred on session with peer 52.88.241.181:7000
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: java.io.IOException: failed to connect
to 52.88.241.181:7000 (STREAM) for streaming data
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.DefaultConnectionFactory.createConnection(DefaultConnectionFactory.java:98)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.DefaultConnectionFactory.createConnection(DefaultConnectionFactory.java:57)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.async.NettyStreamingMessageSender.createChannel(NettyStreamingMessageSender.java:183)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.async.NettyStreamingMessageSender.setupControlMessageChannel(NettyStreamingMessageSender.java:165)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.async.NettyStreamingMessageSender.sendMessage(NettyStreamingMessageSender.java:222)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.async.NettyStreamingMessageSender.initialize(NettyStreamingMessageSender.java:146)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.StreamSession.start(StreamSession.java:271)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.StreamCoordinator$StreamSessionConnector.run(StreamCoordinator.java:273)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at java.lang.Thread.run(Thread.java:748)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: Caused by: io.netty.channel.unix.Errors$NativeIoException:
bind(..) failed: Cannot assign requested address
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.unix.Errors.newIOException(Errors.java:117)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.unix.Socket.bind(Socket.java:266)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.epoll.AbstractEpollStreamChannel.doConnect(AbstractEpollStreamChannel.java:729)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.epoll.EpollSocketChannel.doConnect(EpollSocketChannel.java:184)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.connect(AbstractEpollStreamChannel.java:797)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1274)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:999)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:260)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.bootstrap.Bootstrap$3.run(Bootstrap.java:254)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:312)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         ... 1 common frames omitted
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: INFO  [StreamConnectionEstablisher:1]
2018-04-03 06:04:49,140 StreamResultFuture.java:197 - [Stream #d4cd6420-3703-11e8-a6a5-e51ddc10cfe6]
Session with 52.88.241.181:7000 is complete
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: ERROR [StreamConnectionEstablisher:1]
2018-04-03 06:04:49,146 StreamSession.java:524 - [Stream #d4cd6420-3703-11e8-a6a5-e51ddc10cfe6]
Streaming error occurred on session with peer 52.88.241.181:7000
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: java.lang.RuntimeException: stream has
been closed, cannot send Prepare SYN (3 requests,  0 files}
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.async.NettyStreamingMessageSender.sendMessage(NettyStreamingMessageSender.java:209)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.StreamSession.onInitializationComplete(StreamSession.java:495)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.StreamSession.start(StreamSession.java:272)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.StreamCoordinator$StreamSessionConnector.run(StreamCoordinator.java:273)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at java.lang.Thread.run(Thread.java:748)
> {code}
> Remote debugging it, it eventually seems to come to be due to trying to call {{socket.bind}}
in {{doConnect}} in {{AbstractEpollStreamChannel}}. Local address is <publicIP>:0. I
think port 0 is fine because it just means pick the next ephemeral port available, but there's
no way of binding to the public IP. 
> I made a commit https://github.com/apache/cassandra/commit/00222ddf0694f1c35c2bdd84fd7407174c3fc57a
that seems to have fixed it so I can continue on with my purposes, and that is have {{StreamSession}}
use listen address instead of broadcast address which may be public. In the event listen address
is public I guess that means if it was able to bind to that interface to begin with then {{socket.bind}}
should also work. 
> Not sure if this is the right way though, and it seems to have been introduced by the
Streaming rework to netty. https://issues.apache.org/jira/browse/CASSANDRA-12229. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message