Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 771036B57 for ; Fri, 24 Jun 2011 13:20:49 +0000 (UTC) Received: (qmail 77718 invoked by uid 500); 24 Jun 2011 13:20:47 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 77694 invoked by uid 500); 24 Jun 2011 13:20:47 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 77686 invoked by uid 99); 24 Jun 2011 13:20:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jun 2011 13:20:47 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.214.44] (HELO mail-bw0-f44.google.com) (209.85.214.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jun 2011 13:20:40 +0000 Received: by bwz13 with SMTP id 13so2639925bwz.31 for ; Fri, 24 Jun 2011 06:20:19 -0700 (PDT) MIME-Version: 1.0 Received: by 10.204.17.148 with SMTP id s20mr1925074bka.133.1308921619545; Fri, 24 Jun 2011 06:20:19 -0700 (PDT) Received: by 10.204.68.82 with HTTP; Fri, 24 Jun 2011 06:20:19 -0700 (PDT) In-Reply-To: References: Date: Fri, 24 Jun 2011 08:20:19 -0500 Message-ID: Subject: Re: Restarting cluster From: David McNelis To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=bcaec554070e2c00ea04a6751010 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec554070e2c00ea04a6751010 Content-Type: text/plain; charset=ISO-8859-1 Running on Centos. We had a massive power failure and our UPS wasn't up to 48 hours without power... In this situation the IP addresses have all stayed the same. I can still connect to the "other" node from cli, so I don't think its an issue where the iptables settings weren't saved and started blocking traffic. In terms of the log files, the only related line from the log files is saying: INFO [main] 2011-06-24 07:48:44,750 StorageService.java (line 382) Loading persisted ring state INFO [main] 2011-06-24 07:48:44,757 StorageService.java (line 418) Starting up server gossip When I turn on debugging and restart the non-seed node I get this line: DEBUG [WRITE-/192.168.80.XXX] 2011-06-24 08:04:48,798 OutboundTcpConnection.java (line 161) attempting to connect to /192.168.80.XXX But no errors after it. On Fri, Jun 24, 2011 at 7:58 AM, Sasha Dolgy wrote: > Normally, no. What you've done is fine. What is the environment? > > On amazon EC2 for example, the instance could have crashed, a new one > is brought online and has a different internal IP ... > > in the cassandra/logs/system.log are there any messages on the 2nd > node and how it relates to the seed node? > > On Fri, Jun 24, 2011 at 2:49 PM, David McNelis > wrote: > > I am running 0.8.0 on CentOS. I have a 2 nodes in my cluster, one is a > > seed, the other is autobootstrapped. > > After having an unexpected shutdown of both of the physical machines I am > > trying to restart the cluster. I first started the seed node, it went > > through the normal startup process and finished without error. Once that > > was complete I started the second node, again no errors in the log as it > was > > starting, it started the gossip server, ect. > > However when I look at the ring using nodetool, both machines show their > > own status as up, then show the other machine as Down with a state of > Normal > > and a load of ?. I have tried restarting the individual nodes in > different > > orders, waiting a while after restarting a node, but still the 'other' > node > > always has a status of "down". nodetool repair [keyspace] did not make > any > > difference either and nodetool join just told me that the nodes were > already > > a part of the ring. > > I can't imagine this is how it *should* be behaving... is there a piece > I'm > > missing in terms of getting one node to recognize the other as being Up? > -- *David McNelis* Lead Software Engineer Agentis Energy www.agentisenergy.com o: 630.359.6395 c: 219.384.5143 *A Smart Grid technology company focused on helping consumers of energy control an often under-managed resource.* --bcaec554070e2c00ea04a6751010 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Running on Centos.

We had a massive power failure and ou= r UPS wasn't up to 48 hours without power...

I= n this situation the IP addresses have all stayed the same. =A0I can still = connect to the "other" node from cli, so I don't think its an= issue where the iptables settings weren't saved and started blocking t= raffic.

In terms of the log files, the only related line from t= he log files is saying:

=A0INFO [main] 2011-0= 6-24 07:48:44,750 StorageService.java (line 382) Loading persisted ring sta= te
=A0INFO [main] 2011-06-24 07:48:44,757 StorageService.java (line 418) = Starting up server gossip

When I turn on deb= ugging and restart the non-seed node I get this line:
DEBUG = [WRITE-/192.168.80.XXX] 2011-06-24 08:04:48,798 OutboundTcpConnection.java = (line 161) attempting to connect to /192.168.80.XXX

But no errors after it.

=
On Fri, Jun 24, 2011 at 7:58 AM, Sasha Dolgy= <sdolgy@gmail.com= > wrote:
Normally, no. =A0What you've done is fi= ne. =A0What is the environment?

On amazon EC2 for example, the instance could have crashed, a new one
is brought online and has a different internal IP ...

in the cassandra/logs/system.log are there any messages on the 2nd
node and how it relates to the seed node?

On Fri, Jun 24, 2011 at 2:49 PM, David McNelis
<dmcnelis@agentisenergy.co= m> wrote:
> I am running 0.8.0 on CentOS. =A0I have a 2 nodes in my cluster, one i= s a
> seed, the other is autobootstrapped.
> After having an unexpected shutdown of both of the physical machines I= am
> trying to restart the cluster. =A0I first started the seed node, it we= nt
> through the normal startup process and finished without error. =A0Once= that
> was complete I started the second node, again no errors in the log as = it was
> starting, it started the gossip server, ect.
> However when I look at the ring using nodetool, both machines =A0show = their
> own status as up, then show the other machine as Down with a state of = Normal
> and a load of ?. =A0I have tried restarting the individual nodes in di= fferent
> orders, waiting a while after restarting a node, but still the 'ot= her' node
> always has a status of "down". =A0nodetool repair [keyspace]= did not make any
> difference either and nodetool join just told me that the nodes were a= lready
> a part of the ring.
> I can't imagine this is how it *should* be behaving... is there a = piece I'm
> missing in terms of getting one node to recognize the other as being U= p?



--
David Mc= Nelis
Lead Software Engineer
Agentis Energy
o: 630.359.6395
c: 219.384= .5143

=
A= Smart Grid technology company focused on helping consumers of energy contr= ol an often under-managed resource.


--bcaec554070e2c00ea04a6751010--