incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Skye Book <skye.b...@gmail.com>
Subject Re: Nodes not added to existing cluster
Date Mon, 18 Nov 2013 07:36:00 GMT
Hi there,

I’m bringing this thread back as its something that I thought was solved and is apparently
not fixed on my end.

To recap, I’m having trouble getting a node to join a cluster.  Configuration seems all
right using the EC2MultiRegionSnitch but new nodes are unable to handshake with seeds.

- Security Group has 22 && 1024-65535 open
- Nodes are configured with password authentication using CassandraAuthorizer
- internode_authenticator is commented out in configuration
- rpc_address is set to the instance’s private address
- listen_address is set to the instance’s private address
- broadcast_address is set to the instance's public address

As was suggested earlier, I’ve enabled TRACE logging for OutboundTcpConnection and get the
following dumped into system.log when the new node is started up without itself in the seed
list (if its own IP is in the list it just creates a new single node cluster).  I’ve gisted
the results here: https://gist.github.com/skyebook/be5ee75a000a1e6d65d0

It looks like the handshake process completely and utterly fails as it seems unable to get
any information from the other nodes as evidenced by:
OutboundTcpConnection.java (line 386) Handshaking version with /NODE_1_PUBLIC_IP
OutboundTcpConnection.java (line 386) Handshaking version with /NODE_2_PUBLIC_IP
OutboundTcpConnection.java (line 333) Target max version is -2147483648; no version information
yet, will retry

Thanks in advance for any light you all might be able to shed on what’s going on.

On Sep 26, 2013, at 9:03 PM, Aaron Morton <aaron@thelastpickle.com> wrote:

>>  INFO 05:03:49,015 Cannot handshake version with /aa.bb.cc.dd
>>  INFO 05:03:49,017 Handshaking version with /aa.bb.cc.dd
> If you can turn up logging to TRACE for org.apache.cassandra.net.OutboundTcpConnection
it will include the full error. 
> 
>> The two addresses that it is unable to handshake with are the other two addresses
of nodes in the cluster I'm unable to join.
> Are you mixing versions ? 
> 
> 
> Cheers
> 
> -----------------
> Aaron Morton
> New Zealand
> @aaronmorton
> 
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
> 
> On 26/09/2013, at 5:13 PM, Skye Book <skye.book@gmail.com> wrote:
> 
>> Hi Aaron, thanks for the clarification.
>> 
>> As might be expected, having the broadcast_address fixed hasn't fixed anything. 
What I did find after writing my last email is that output.log is littered with these:
>> 
>>  INFO 05:03:49,015 Cannot handshake version with /aa.bb.cc.dd
>>  INFO 05:03:49,017 Handshaking version with /aa.bb.cc.dd
>>  INFO 05:03:49,803 Cannot handshake version with /ww.xx.yy.zz
>>  INFO 05:03:49,805 Handshaking version with /ww.xx.yy.zz
>> 
>> The two addresses that it is unable to handshake with are the other two addresses
of nodes in the cluster I'm unable to join.  I started thinking that maybe EC2 was having
an-advertised problem communicating between AZ's but bringing up nodes in both of the other
availability zones resulted in the same wrong behavior.
>> 
>> I've gist'd my cassandra.yaml, its pretty standard and hasn't caused an issue in
the past for me.  https://gist.github.com/skyebook/ec9364cdcec02e803ffc
>> 
>> Skye Book
>> http://skyebook.net -- @sbook
>> 
>> On Sep 26, 2013, at 12:34 AM, Aaron Morton <aaron@thelastpickle.com> wrote:
>> 
>>>>  I am curious, though, how any of this worked in the first place spread across
three AZ's without that being set?
>>> boradcast_address is only needed when you are going cross region (IIRC it's the
EC2MultiRegionSnitch) that sets it. 
>>> 
>>> As rob said, make sure the seed list includes on of the other nodes and that
the cluster_name set. 
>>> 
>>> Cheers
>>> 
>>> -----------------
>>> Aaron Morton
>>> New Zealand
>>> @aaronmorton
>>> 
>>> Co-Founder & Principal Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>> 
>>> On 26/09/2013, at 8:12 AM, Skye Book <skye.book@gmail.com> wrote:
>>> 
>>>> Thank you, both Michael and Robert for your suggestions.  I actually saw
5760, but we were running on 2.0.0, which it seems like this was fixed in.
>>>> 
>>>> That said, I noticed that my Chef scripts were failing to set the broadcast_address
correctly, which I'm guessing is the cause of the problem, fixing that and trying a redeploy.
 I am curious, though, how any of this worked in the first place spread across three AZ's
without that being set?
>>>> 
>>>> -Skye
>>>> 
>>>> On Sep 25, 2013, at 3:56 PM, Robert Coli <rcoli@eventbrite.com> wrote:
>>>> 
>>>>> On Wed, Sep 25, 2013 at 12:41 PM, Skye Book <skye.book@gmail.com>
wrote:
>>>>> I have a three node cluster using the EC2 Multi-Region Snitch currently
operating only in US-EAST.  On having a node go down this morning, I started a new node with
an identical configuration, except for the seed list, the listen address and the rpc address.
 The new node comes up and creates its own cluster rather than joining the pre-existing ring.
 I've tried creating a node both before ad after using `nodetool remove` for the bad node,
each time with the same result.
>>>>> 
>>>>> What version of Cassandra?
>>>>> 
>>>>> This particular confusing behavior is fixed upstream, in a version you
should not deploy to production yet. Take some solace, however, that you may be the last Cassandra
administrator to die for a broken code path!
>>>>> 
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-5768
>>>>> 
>>>>> Does anyone have any suggestions for where to look that might put me
on the right track?
>>>>> 
>>>>> It must be that your seed list is wrong in some way, or your node state
is wrong. If you're trying to bootstrap a node, note that you can't bootstrap a node when
it is in its own seed list.
>>>>> 
>>>>> If you have installed Cassandra via debian package, there is a possibility
that your node has started before you explicitly started it. If so, it might have invalid
node state.
>>>>> 
>>>>> Have you tried wiping the data directory and trying again?
>>>>> 
>>>>> What is your seed list? Are you sure the new node can reach the seeds
on the network layer?
>>>>> 
>>>>> =Rob
>>>> 
>>> 
>> 
> 


Mime
View raw message