Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 855B863AD for ; Fri, 24 Jun 2011 14:50:56 +0000 (UTC) Received: (qmail 29598 invoked by uid 500); 24 Jun 2011 14:50:54 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 29487 invoked by uid 500); 24 Jun 2011 14:50:53 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 29478 invoked by uid 99); 24 Jun 2011 14:50:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jun 2011 14:50:53 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.214.44] (HELO mail-bw0-f44.google.com) (209.85.214.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jun 2011 14:50:48 +0000 Received: by bwz13 with SMTP id 13so2717154bwz.31 for ; Fri, 24 Jun 2011 07:50:27 -0700 (PDT) MIME-Version: 1.0 Received: by 10.204.75.94 with SMTP id x30mr509169bkj.79.1308927026606; Fri, 24 Jun 2011 07:50:26 -0700 (PDT) Received: by 10.204.68.82 with HTTP; Fri, 24 Jun 2011 07:50:26 -0700 (PDT) In-Reply-To: References: Date: Fri, 24 Jun 2011 09:50:26 -0500 Message-ID: Subject: Re: Restarting cluster From: David McNelis To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=00504502d7577531c404a67652bf --00504502d7577531c404a67652bf Content-Type: text/plain; charset=ISO-8859-1 It was port 7000 that was my issue. I was thinking everything was going off 9160, and hadn't made sure that port was open. Thanks Sasha and Jonathan. On Fri, Jun 24, 2011 at 8:42 AM, Jonathan Ellis wrote: > Did you try netcat to verify that you can get to the internal port on > machine X from machine Y? > > On Fri, Jun 24, 2011 at 8:20 AM, David McNelis > wrote: > > Running on Centos. > > We had a massive power failure and our UPS wasn't up to 48 hours without > > power... > > In this situation the IP addresses have all stayed the same. I can still > > connect to the "other" node from cli, so I don't think its an issue where > > the iptables settings weren't saved and started blocking traffic. > > In terms of the log files, the only related line from the log files is > > saying: > > INFO [main] 2011-06-24 07:48:44,750 StorageService.java (line 382) > Loading > > persisted ring state > > INFO [main] 2011-06-24 07:48:44,757 StorageService.java (line 418) > Starting > > up server gossip > > When I turn on debugging and restart the non-seed node I get this line: > > DEBUG [WRITE-/192.168.80.XXX] 2011-06-24 08:04:48,798 > > OutboundTcpConnection.java (line 161) attempting to connect to > > /192.168.80.XXX > > But no errors after it. > > > > On Fri, Jun 24, 2011 at 7:58 AM, Sasha Dolgy wrote: > >> > >> Normally, no. What you've done is fine. What is the environment? > >> > >> On amazon EC2 for example, the instance could have crashed, a new one > >> is brought online and has a different internal IP ... > >> > >> in the cassandra/logs/system.log are there any messages on the 2nd > >> node and how it relates to the seed node? > >> > >> On Fri, Jun 24, 2011 at 2:49 PM, David McNelis > >> wrote: > >> > I am running 0.8.0 on CentOS. I have a 2 nodes in my cluster, one is > a > >> > seed, the other is autobootstrapped. > >> > After having an unexpected shutdown of both of the physical machines I > >> > am > >> > trying to restart the cluster. I first started the seed node, it went > >> > through the normal startup process and finished without error. Once > >> > that > >> > was complete I started the second node, again no errors in the log as > it > >> > was > >> > starting, it started the gossip server, ect. > >> > However when I look at the ring using nodetool, both machines show > >> > their > >> > own status as up, then show the other machine as Down with a state of > >> > Normal > >> > and a load of ?. I have tried restarting the individual nodes in > >> > different > >> > orders, waiting a while after restarting a node, but still the 'other' > >> > node > >> > always has a status of "down". nodetool repair [keyspace] did not > make > >> > any > >> > difference either and nodetool join just told me that the nodes were > >> > already > >> > a part of the ring. > >> > I can't imagine this is how it *should* be behaving... is there a > piece > >> > I'm > >> > missing in terms of getting one node to recognize the other as being > Up? > > > > > > > > -- > > David McNelis > > Lead Software Engineer > > Agentis Energy > > www.agentisenergy.com > > o: 630.359.6395 > > c: 219.384.5143 > > A Smart Grid technology company focused on helping consumers of energy > > control an often under-managed resource. > > > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com > -- *David McNelis* Lead Software Engineer Agentis Energy www.agentisenergy.com o: 630.359.6395 c: 219.384.5143 *A Smart Grid technology company focused on helping consumers of energy control an often under-managed resource.* --00504502d7577531c404a67652bf Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable It was port 7000 that was my issue. =A0I was thinking everything was going = off 9160, and hadn't made sure that port was open.

T= hanks Sasha and Jonathan.

On Fri, Jun 24,= 2011 at 8:42 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
Did you try netcat to verify that you can g= et to the internal port on
machine X from machine Y?

On Fri, Jun 24, 2011 at 8:20 AM, David McNelis
<dmcnelis@agentisenergy.com> wrote:
> Running on Centos.
> We had a massive power failure and our UPS wasn't up to 48 hours w= ithout
> power...
> In this situation the IP addresses have all stayed the same. =A0I can = still
> connect to the "other" node from cli, so I don't think i= ts an issue where
> the iptables settings weren't saved and started blocking traffic.<= br> > In terms of the log files, the only related line from the log files is=
> saying:
> =A0INFO [main] 2011-06-24 07:48:44,750 StorageService.java (line 382) = Loading
> persisted ring state
> =A0INFO [main] 2011-06-24 07:48:44,757 StorageService.java (line 418) = Starting
> up server gossip
> When I turn on debugging and restart the non-seed node I get this line= :
> DEBUG [WRITE-/192.168.80.XXX] 2011-06-24 08:04:48,798
> OutboundTcpConnection.java (line 161) attempting to connect to
> /192.168.80.XXX
> But no errors after it.
>
> On Fri, Jun 24, 2011 at 7:58 AM, Sasha Dolgy <sdolgy@gmail.com> wrote:
>>
>> Normally, no. =A0What you've done is fine. =A0What is the envi= ronment?
>>
>> On amazon EC2 for example, the instance could have crashed, a new = one
>> is brought online and has a different internal IP ...
>>
>> in the cassandra/logs/system.log are there any messages on the 2nd=
>> node and how it relates to the seed node?
>>
>> On Fri, Jun 24, 2011 at 2:49 PM, David McNelis
>> <dmcnelis@agentis= energy.com> wrote:
>> > I am running 0.8.0 on CentOS. =A0I have a 2 nodes in my clust= er, one is a
>> > seed, the other is autobootstrapped.
>> > After having an unexpected shutdown of both of the physical m= achines I
>> > am
>> > trying to restart the cluster. =A0I first started the seed no= de, it went
>> > through the normal startup process and finished without error= . =A0Once
>> > that
>> > was complete I started the second node, again no errors in th= e log as it
>> > was
>> > starting, it started the gossip server, ect.
>> > However when I look at the ring using nodetool, both machines= =A0show
>> > their
>> > own status as up, then show the other machine as Down with a = state of
>> > Normal
>> > and a load of ?. =A0I have tried restarting the individual no= des in
>> > different
>> > orders, waiting a while after restarting a node, but still th= e 'other'
>> > node
>> > always has a status of "down". =A0nodetool repair [= keyspace] did not make
>> > any
>> > difference either and nodetool join just told me that the nod= es were
>> > already
>> > a part of the ring.
>> > I can't imagine this is how it *should* be behaving... is= there a piece
>> > I'm
>> > missing in terms of getting one node to recognize the other a= s being Up?
>
>
>
> --
> David McNelis
> Lead Software Engineer
> Agentis Energy
> www.agentis= energy.com
> o: 630.359.6395
> c:
219.384.5143
> A Smart Grid technology company focused on helping consumers of energy=
> control an often under-managed resource.
>
>



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.c= om



--
David McNelis=
Lead Software Engineer
Agentis Energy
www.agentisenergy.com
o: 630.359.= 6395
c: 219.384.5143

A Smart Grid technology company focused on helping consumers of energ= y control an often under-managed resource.


--00504502d7577531c404a67652bf--