Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <BANLkTinapLhp7oY8Fe05Wgy-oQVLr_ogJw@mail.gmail.com>
References: <BANLkTikK_HLwFRZHZWn5gBxXrdhMichPbg@mail.gmail.com>
	<BANLkTinapLhp7oY8Fe05Wgy-oQVLr_ogJw@mail.gmail.com>
Date: Fri, 24 Jun 2011 08:20:19 -0500
Message-ID: <BANLkTi=9DMC8Yby_67JYs5kR2BHzjLrQVA@mail.gmail.com>
Subject: Re: Restarting cluster
From: David McNelis <dmcnelis@agentisenergy.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=bcaec554070e2c00ea04a6751010

--bcaec554070e2c00ea04a6751010
Content-Type: text/plain; charset=ISO-8859-1

Running on Centos.

We had a massive power failure and our UPS wasn't up to 48 hours without
power...

In this situation the IP addresses have all stayed the same.  I can still
connect to the "other" node from cli, so I don't think its an issue where
the iptables settings weren't saved and started blocking traffic.

In terms of the log files, the only related line from the log files is
saying:

 INFO [main] 2011-06-24 07:48:44,750 StorageService.java (line 382) Loading
persisted ring state
 INFO [main] 2011-06-24 07:48:44,757 StorageService.java (line 418) Starting
up server gossip

When I turn on debugging and restart the non-seed node I get this line:
DEBUG [WRITE-/192.168.80.XXX] 2011-06-24 08:04:48,798
OutboundTcpConnection.java (line 161) attempting to connect to
/192.168.80.XXX

But no errors after it.


On Fri, Jun 24, 2011 at 7:58 AM, Sasha Dolgy <sdolgy@gmail.com> wrote:

> Normally, no.  What you've done is fine.  What is the environment?
>
> On amazon EC2 for example, the instance could have crashed, a new one
> is brought online and has a different internal IP ...
>
> in the cassandra/logs/system.log are there any messages on the 2nd
> node and how it relates to the seed node?
>
> On Fri, Jun 24, 2011 at 2:49 PM, David McNelis
> <dmcnelis@agentisenergy.com> wrote:
> > I am running 0.8.0 on CentOS.  I have a 2 nodes in my cluster, one is a
> > seed, the other is autobootstrapped.
> > After having an unexpected shutdown of both of the physical machines I am
> > trying to restart the cluster.  I first started the seed node, it went
> > through the normal startup process and finished without error.  Once that
> > was complete I started the second node, again no errors in the log as it
> was
> > starting, it started the gossip server, ect.
> > However when I look at the ring using nodetool, both machines  show their
> > own status as up, then show the other machine as Down with a state of
> Normal
> > and a load of ?.  I have tried restarting the individual nodes in
> different
> > orders, waiting a while after restarting a node, but still the 'other'
> node
> > always has a status of "down".  nodetool repair [keyspace] did not make
> any
> > difference either and nodetool join just told me that the nodes were
> already
> > a part of the ring.
> > I can't imagine this is how it *should* be behaving... is there a piece
> I'm
> > missing in terms of getting one node to recognize the other as being Up?
>


-- 
*David McNelis*
Lead Software Engineer
Agentis Energy
www.agentisenergy.com
o: 630.359.6395
c: 219.384.5143

*A Smart Grid technology company focused on helping consumers of energy
control an often under-managed resource.*

--bcaec554070e2c00ea04a6751010
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Running on Centos.<div><br></div><div>We had a massive power failure and ou=
r UPS wasn&#39;t up to 48 hours without power...</div><div><br></div><div>I=
n this situation the IP addresses have all stayed the same. =A0I can still =
connect to the &quot;other&quot; node from cli, so I don&#39;t think its an=
 issue where the iptables settings weren&#39;t saved and started blocking t=
raffic.</div>
<div><br></div><div>In terms of the log files, the only related line from t=
he log files is saying:</div><div><br></div><div><div>=A0INFO [main] 2011-0=
6-24 07:48:44,750 StorageService.java (line 382) Loading persisted ring sta=
te</div>
<div>=A0INFO [main] 2011-06-24 07:48:44,757 StorageService.java (line 418) =
Starting up server gossip</div></div><div><br></div><div>When I turn on deb=
ugging and restart the non-seed node I get this line:</div><div><div>DEBUG =
[WRITE-/192.168.80.XXX] 2011-06-24 08:04:48,798 OutboundTcpConnection.java =
(line 161) attempting to connect to /192.168.80.XXX</div>
</div><div><br></div><div>But no errors after it.</div><div><br></div><div>=
<br><div class=3D"gmail_quote">On Fri, Jun 24, 2011 at 7:58 AM, Sasha Dolgy=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:sdolgy@gmail.com">sdolgy@gmail.com=
</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex;">Normally, no. =A0What you&#39;ve done is fi=
ne. =A0What is the environment?<br>
<br>
On amazon EC2 for example, the instance could have crashed, a new one<br>
is brought online and has a different internal IP ...<br>
<br>
in the cassandra/logs/system.log are there any messages on the 2nd<br>
node and how it relates to the seed node?<br>
<div><div></div><div class=3D"h5"><br>
On Fri, Jun 24, 2011 at 2:49 PM, David McNelis<br>
&lt;<a href=3D"mailto:dmcnelis@agentisenergy.com">dmcnelis@agentisenergy.co=
m</a>&gt; wrote:<br>
&gt; I am running 0.8.0 on CentOS. =A0I have a 2 nodes in my cluster, one i=
s a<br>
&gt; seed, the other is autobootstrapped.<br>
&gt; After having an unexpected shutdown of both of the physical machines I=
 am<br>
&gt; trying to restart the cluster. =A0I first started the seed node, it we=
nt<br>
&gt; through the normal startup process and finished without error. =A0Once=
 that<br>
&gt; was complete I started the second node, again no errors in the log as =
it was<br>
&gt; starting, it started the gossip server, ect.<br>
&gt; However when I look at the ring using nodetool, both machines =A0show =
their<br>
&gt; own status as up, then show the other machine as Down with a state of =
Normal<br>
&gt; and a load of ?. =A0I have tried restarting the individual nodes in di=
fferent<br>
&gt; orders, waiting a while after restarting a node, but still the &#39;ot=
her&#39; node<br>
&gt; always has a status of &quot;down&quot;. =A0nodetool repair [keyspace]=
 did not make any<br>
&gt; difference either and nodetool join just told me that the nodes were a=
lready<br>
&gt; a part of the ring.<br>
&gt; I can&#39;t imagine this is how it *should* be behaving... is there a =
piece I&#39;m<br>
&gt; missing in terms of getting one node to recognize the other as being U=
p?<br>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br><b>David Mc=
Nelis</b><div><font size=3D"1" color=3D"#666666">Lead Software Engineer</fo=
nt></div><div><font size=3D"1" color=3D"#666666">Agentis Energy</font></div=
><div>
<font size=3D"1" color=3D"#666666"><a href=3D"http://www.agentisenergy.com"=
 target=3D"_blank">www.agentisenergy.com</a></font></div><div><span style=
=3D"font-size:x-small;color:rgb(102, 102, 102)">o: 630.359.6395</span></div=
><div><span style=3D"font-size:x-small;color:rgb(102, 102, 102)">c: 219.384=
.5143</span></div>
<div><span style=3D"font-size:x-small;color:rgb(102, 102, 102)"><br></span>=
</div><div><span style=3D"font-family:&#39;Helvetica Neue&#39;, Helvetica, =
Arial, sans-serif;line-height:18px"><font color=3D"#666666" size=3D"1"><i>A=
 Smart Grid technology company focused on helping consumers of energy contr=
ol an often under-managed resource.</i></font></span></div>
<div><br></div><br>
</div>

--bcaec554070e2c00ea04a6751010--