cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Brown (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-14362) Node unable to join cluster in trunk
Date Thu, 05 Apr 2018 01:59:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-14362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426241#comment-16426241
] 

Jason Brown edited comment on CASSANDRA-14362 at 4/5/18 1:58 AM:
-----------------------------------------------------------------

So it looks like when working on CASSANDRA-8457/CASSANDRA-12229, I introduced a regression
of CASSANDRA-12673. The short of it is we no longer bind the local side of a outbound connection
as of CASSANDRA-12673, yet I mistakenly brought that back (see [{{NettyFactory#createOutboundBootstrap()}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/async/NettyFactory.java#L337]).

I believe the most correct fix is just not setting the local address to bind to for outbound
connections, as that's bascially what CASSANDRA-12673 does. Removing the local bind should
resolve [~Lerh Low]'s error as well as eliminate the regression against CASSANDRA-12673. However,
we should also incorporate [~Lerh Low]'s fix as well, to clarify the intent of which address
we want to 'reference' (but not actually use) on the local side.
||14362||
|[branch|https://github.com/jasobrown/cassandra/tree/14362]|
|[utests & dtests|https://circleci.com/gh/jasobrown/workflows/cassandra/tree/14362]|

tests are running now. 

[~aweisberg] do you mind reviewing?

UPDATE: dtests seem to be pretty unhappy. I'll investigate what's wrong and ping this ticket
with updates.


was (Author: jasobrown):
So it looks like when working on CASSANDRA-8457/CASSANDRA-12229, I introduced a regression
of CASSANDRA-12673. The short of it is we no longer bind the local side of a outbound connection
as of CASSANDRA-12673, yet I mistakenly brought that back (see [{{NettyFactory#createOutboundBootstrap()}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/async/NettyFactory.java#L337]).

I believe the most correct fix is just not setting the local address to bind to for outbound
connections, as that's bascially what CASSANDRA-12673 does. Removing the local bind should
resolve [~Lerh Low]'s error as well as eliminate the regression against CASSANDRA-12673. However,
we should also incorporate [~Lerh Low]'s fix as well, to clarify the intent of which address
we want to 'reference' (but not actually use) on the local side.
||14362||
|[branch|https://github.com/jasobrown/cassandra/tree/14362]|
|[utests & dtests|https://circleci.com/gh/jasobrown/workflows/cassandra/tree/14362]|

tests are running now. 

[~aweisberg] do you mind reviewing?

> Node unable to join cluster in trunk
> ------------------------------------
>
>                 Key: CASSANDRA-14362
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14362
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Lerh Chuan Low
>            Assignee: Jason Brown
>            Priority: Major
>             Fix For: 4.0
>
>
> I'm not sure if anybody has tried using {{trunk}} and starting an actual EC2 cluster,
but with a 3 node setup that works on 3.11, it fails to work on {{trunk}}. I mistakenly stumbled
into it while testing my own code changes. 
> My setup is as follows:
> broadcast_rpc_address: publicIP
> broadcast_address: publicIP
> listen_address: omitted. Ends up as privateIP. 
> Works on 3.11 just fine. 
> On {{trunk}} though, it never works. My node is never able to join the cluster:
> {code}
> Apr 03 05:57:06 ip-10-0-47-122 cassandra[13914]: INFO  [main] 2018-04-03 05:57:05,895
RangeStreamer.java:195 - Bootstrap: range (-128373781239966537,-122439194129870521] exists
on 52.88.241.181:7000 for keyspace system_traces
> Apr 03 05:57:06 ip-10-0-47-122 cassandra[13914]: INFO  [main] 2018-04-03 05:57:05,895
RangeStreamer.java:195 - Bootstrap: range (6968670424536541270,6973888347502882935] exists
on 52.88.241.181:7000 for keyspace system_traces
> Apr 03 05:57:42 ip-10-0-47-122 cassandra[13914]: WARN  [GossipTasks:1] 2018-04-03 05:57:42,298
FailureDetector.java:324 - Not marking nodes down due to local pause of 26215173446ns >
5000000000ns
> Apr 03 05:57:53 ip-10-0-47-122 cassandra[13914]: WARN  [GossipTasks:1] 2018-04-03 05:57:53,035
FailureDetector.java:324 - Not marking nodes down due to local pause of 10736485907ns >
5000000000ns
> Apr 03 05:58:30 ip-10-0-47-122 cassandra[13914]: WARN  [GossipTasks:1] 2018-04-03 05:58:30,790
Gossiper.java:814 - Gossip stage has 28 pending tasks; skipping status check (no nodes will
be marked down)
> Apr 03 05:58:33 ip-10-0-47-122 cassandra[13914]: WARN  [GossipTasks:1] 2018-04-03 05:58:33,060
Gossiper.java:814 - Gossip stage has 20 pending tasks; skipping status check (no nodes will
be marked down)
> Apr 03 06:04:33 ip-10-0-47-122 cassandra[13914]: WARN  [GossipTasks:1] 2018-04-03 06:04:33,826
FailureDetector.java:324 - Not marking nodes down due to local pause of 400790432954ns >
5000000000ns
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: WARN  [GossipTasks:1] 2018-04-03 06:04:49,133
Gossiper.java:814 - Gossip stage has 2 pending tasks; skipping status check (no nodes will
be marked down)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: ERROR [StreamConnectionEstablisher:1]
2018-04-03 06:04:49,138 StreamSession.java:524 - [Stream #d4cd6420-3703-11e8-a6a5-e51ddc10cfe6]
Streaming error occurred on session with peer 52.88.241.181:7000
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: java.io.IOException: failed to connect
to 52.88.241.181:7000 (STREAM) for streaming data
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.DefaultConnectionFactory.createConnection(DefaultConnectionFactory.java:98)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.DefaultConnectionFactory.createConnection(DefaultConnectionFactory.java:57)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.async.NettyStreamingMessageSender.createChannel(NettyStreamingMessageSender.java:183)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.async.NettyStreamingMessageSender.setupControlMessageChannel(NettyStreamingMessageSender.java:165)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.async.NettyStreamingMessageSender.sendMessage(NettyStreamingMessageSender.java:222)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.async.NettyStreamingMessageSender.initialize(NettyStreamingMessageSender.java:146)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.StreamSession.start(StreamSession.java:271)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.StreamCoordinator$StreamSessionConnector.run(StreamCoordinator.java:273)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at java.lang.Thread.run(Thread.java:748)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: Caused by: io.netty.channel.unix.Errors$NativeIoException:
bind(..) failed: Cannot assign requested address
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.unix.Errors.newIOException(Errors.java:117)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.unix.Socket.bind(Socket.java:266)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.epoll.AbstractEpollStreamChannel.doConnect(AbstractEpollStreamChannel.java:729)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.epoll.EpollSocketChannel.doConnect(EpollSocketChannel.java:184)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.connect(AbstractEpollStreamChannel.java:797)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1274)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:999)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:260)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.bootstrap.Bootstrap$3.run(Bootstrap.java:254)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:312)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         ... 1 common frames omitted
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: INFO  [StreamConnectionEstablisher:1]
2018-04-03 06:04:49,140 StreamResultFuture.java:197 - [Stream #d4cd6420-3703-11e8-a6a5-e51ddc10cfe6]
Session with 52.88.241.181:7000 is complete
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: ERROR [StreamConnectionEstablisher:1]
2018-04-03 06:04:49,146 StreamSession.java:524 - [Stream #d4cd6420-3703-11e8-a6a5-e51ddc10cfe6]
Streaming error occurred on session with peer 52.88.241.181:7000
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]: java.lang.RuntimeException: stream has
been closed, cannot send Prepare SYN (3 requests,  0 files}
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.async.NettyStreamingMessageSender.sendMessage(NettyStreamingMessageSender.java:209)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.StreamSession.onInitializationComplete(StreamSession.java:495)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.StreamSession.start(StreamSession.java:272)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.streaming.StreamCoordinator$StreamSessionConnector.run(StreamCoordinator.java:273)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> Apr 03 06:04:49 ip-10-0-47-122 cassandra[13914]:         at java.lang.Thread.run(Thread.java:748)
> {code}
> Remote debugging it, it eventually seems to come to be due to trying to call {{socket.bind}}
in {{doConnect}} in {{AbstractEpollStreamChannel}}. Local address is <publicIP>:0. I
think port 0 is fine because it just means pick the next ephemeral port available, but there's
no way of binding to the public IP. 
> I made a commit https://github.com/apache/cassandra/commit/00222ddf0694f1c35c2bdd84fd7407174c3fc57a
that seems to have fixed it so I can continue on with my purposes, and that is have {{StreamSession}}
use listen address instead of broadcast address which may be public. In the event listen address
is public I guess that means if it was able to bind to that interface to begin with then {{socket.bind}}
should also work. 
> Not sure if this is the right way though, and it seems to have been introduced by the
Streaming rework to netty. https://issues.apache.org/jira/browse/CASSANDRA-12229. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message