From user-return-27972-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Fri Aug 3 22:25:35 2012 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D969ED466 for ; Fri, 3 Aug 2012 22:25:35 +0000 (UTC) Received: (qmail 58046 invoked by uid 500); 3 Aug 2012 22:25:33 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 58015 invoked by uid 500); 3 Aug 2012 22:25:33 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 58007 invoked by uid 99); 3 Aug 2012 22:25:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Aug 2012 22:25:33 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of Edward.Sargisson@globalrelay.net designates 208.81.212.160 as permitted sender) Received: from [208.81.212.160] (HELO ex1.globalrelay.net) (208.81.212.160) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Aug 2012 22:25:27 +0000 Received: from [10.5.5.212] (208.81.212.224) by ex1.office.globalrelay.net (10.6.60.10) with Microsoft SMTP Server id 8.1.436.0; Fri, 3 Aug 2012 15:25:06 -0700 Message-ID: <501C4FC2.9050502@globalrelay.net> Date: Fri, 3 Aug 2012 15:25:06 -0700 From: Edward Sargisson User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120714 Thunderbird/14.0 MIME-Version: 1.0 To: Subject: Node doesn't rejoin ring after restart Content-Type: multipart/alternative; boundary="------------070703060404010603060303" X-Virus-Checked: Checked by ClamAV on apache.org --------------070703060404010603060303 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Hi all, I'm testing our procedures for handling some Cassandra failure scenarios and I'm not understanding something. I'm testing on a 3 node cluster with a replication_factor of 3. I stopped one of the nodes for 5 or so minutes and run some application tests. Everything was fine. Then I started cassandra on that node again and it refuses to re-join the ring. It can see itself as up but not the other nodes. The other nodes can see themselves but don't see it as up. I deliberately haven't followed any of the token replacement methods outlined in the docs. I'm working on the assumption that a small outage on one node shouldn't cause extraordinary action. Nor do I want to have to stop every node before bringing them up one by one. What am I missing? Am I forced into those time consuming methods every time I want to restart? Thoughts? Cheers, Edward -- Edward Sargisson senior java developer Global Relay edward.sargisson@globalrelay.net *866.484.6630* New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about *Global Relay Message* *--- *The Future of Collaboration in the Financial Services World * *All email sent to or from this address will be retained by Global Relay's email archiving system. This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. Global Relay will not be liable for any compliance or technical information provided herein. All trademarks are the property of their respective owners. --------------070703060404010603060303 Content-Type: text/html; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Hi all,
I'm testing our procedures for handling some Cassandra failure scenarios and I'm not understanding something.

I'm testing on a 3 node cluster with a replication_factor of 3.
I stopped one of the nodes for 5 or so minutes and run some application tests. Everything was fine.

Then I started cassandra on that node again and it refuses to re-join the ring. It can see itself as up but not the other nodes. The other nodes can see themselves but don't see it as up.

I deliberately haven't followed any of the token replacement methods outlined in the docs. I'm working on the assumption that a small outage on one node shouldn't cause extraordinary action.

Nor do I want to have to stop every node before bringing them up one by one.

What am I missing? Am I forced into those time consuming methods every time I want to restart?

Thoughts?

Cheers,
Edward

--

Edward Sargisson

senior java developer
Global Relay

edward.sargisson@globalrelay.net


866.484.6630 
New York | Chicago | Vancouver 
London  (+44.0800.032.9829)  Singapore  (+65.3158.1301)

Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. 


Ask about Global Relay MessageThe Future of Collaboration in the Financial Services World


All email sent to or from this address will be retained by Global Relay’s email archiving system. This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law.  Global Relay will not be liable for any compliance or technical information provided herein.  All trademarks are the property of their respective owners.

--------------070703060404010603060303--