Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of rantav@gmail.com designates
 209.85.216.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc:content-type;
        b=d2LccBADVTcwEkRrR7v+MfjttAixqwkoQpGupye39K7hsRmGUQUIz3/HZZrZfUIe+M
         MdcbX+xq9q7M+13ExVnzMOiDsV4vxyxDvICcdmUE90ZIR3jEkgMF/GSYjy8a3O3pV5v+
         pQoTbPTC8QKXjaj2lYJDic7ycsvJoCI8pOxVE=
MIME-Version: 1.0
In-Reply-To: <AANLkTi=izNb3QBgbfY=h_9WvtW9tYXSuMtMq7TfaNMqq@mail.gmail.com>
References: <AANLkTimFNXsgPgOzVebnJnqtWC+f8VULxU2o4Z8=z=V9@mail.gmail.com>
 <AANLkTim7LFoRUMMXRo6wbqm5OTvk0GvdhRZFcdj=gK6_@mail.gmail.com>
 <AANLkTikHJmPcKsfNEhUd2Kx21t5zktQM1yBqKA9EP-t=@mail.gmail.com>
 <AANLkTi=J=J72N3=oeTuCRWOxEgZ-Z254kayQ0=hxxXCS@mail.gmail.com>
 <AANLkTikp4qQOZv3=UxuOis3wsMCtoOsoMW=MbEf3f=jW@mail.gmail.com>
 <AANLkTikmmB2QkXf651QFOVNZ08E0uWVf3Y2vzv30wzqa@mail.gmail.com>
 <AANLkTik-NGktOTPs7wAd5MCxoVYmnbF9cnqHCMTVOggc@mail.gmail.com>
 <AANLkTimHx6Q4o-JP_Mo-ruhqWUvZGPAPR8=iZdJiD9+8@mail.gmail.com>
 <AANLkTinso84ESyOG+i+H1dBvB89ZO2yJu6ingx81C7sq@mail.gmail.com>
 <AANLkTi=9=BHfHjskGA8+wzY193xwgvcYGz4+eB9MjeEo@mail.gmail.com>
 <AANLkTik6BShZWRgehaNMHL9GB7ZkW6emHg2p3noPQ7O7@mail.gmail.com>
 <AANLkTikxb5joHxr1TavBiB3UCPmq4VeUXbpr0RLK=SVa@mail.gmail.com>
 <AANLkTikSNzJ=LLReXGeVFtbeVYxSjpFXOR_DF+Etnh08@mail.gmail.com>
 <AANLkTik-WAe71tKcTPm_8TRD-HSLKa2hNV8heaNCg0q-@mail.gmail.com>
 <AANLkTimL+HnhWYJWTJJo4XzFH0b4YvUEpQhA_8L9EeZk@mail.gmail.com>
 <AANLkTi=izNb3QBgbfY=h_9WvtW9tYXSuMtMq7TfaNMqq@mail.gmail.com>
From: Ran Tavory <rantav@gmail.com>
Date: Wed, 5 Jan 2011 17:05:04 +0200
Message-ID: <AANLkTi=vy+Jg8oYoY8nYnyx=eK=L7UL=dMGv65R=wbo5@mail.gmail.com>
Subject: Re: Bootstrapping taking long
To: user@cassandra.apache.org
Cc: Marco Supino <Marco@outbrain.com>
Content-Type: multipart/alternative; boundary=00c09fa9c4ff93bafa04991ab7df

--00c09fa9c4ff93bafa04991ab7df
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

In storage-conf I see this comment [1] from which I understand that the
recommended way to bootstrap a new node is to set AutoBootstrap=3Dtrue and
remove itself from the seeds list.
Moreover, I did try to set AutoBootstrap=3Dtrue and have the node in its ow=
n
seeds list, but it would not bootstrap. I don't recall the exact message bu=
t
it was something like "I found myself in the seeds list therefore I'm not
going to bootstrap even though AutoBootstrap is true".

[1]
  <!--
   ~ Turn on to make new [non-seed] nodes automatically migrate the right
data
   ~ to themselves.  (If no InitialToken is specified, they will pick one
   ~ such that they will get half the range of the most-loaded node.)
   ~ If a node starts up without bootstrapping, it will mark itself
bootstrapped
   ~ so that you can't subsequently accidently bootstrap a node with
   ~ data on it.  (You can reset this by wiping your data and commitlog
   ~ directories.)
   ~
   ~ Off by default so that new clusters and upgraders from 0.4 don't
   ~ bootstrap immediately.  You should turn this on when you start adding
   ~ new nodes to a cluster that already has data on it.  (If you are
upgrading
   ~ from 0.4, start your cluster with it off once before changing it to
true.
   ~ Otherwise, no data will be lost but you will incur a lot of unnecessar=
y
   ~ I/O before your cluster starts up.)
  -->
  <AutoBootstrap>false</AutoBootstrap>

On Wed, Jan 5, 2011 at 4:58 PM, David Boxenhorn <david@lookin2.com> wrote:

> If "seed list should be the same across the cluster" that means that node=
s
> *should* have themselves as a seed. If that doesn't work for Ran, then th=
at
> is the first problem, no?
>
>
> On Wed, Jan 5, 2011 at 3:56 PM, Jake Luciani <jakers@gmail.com> wrote:
>
>> Well your ring issues don't make sense to me, seed list should be the sa=
me
>> across the cluster.
>> I'm just thinking of other things to try, non-boostrapped nodes should
>> join the ring instantly but reads will fail if you aren't using quorum.
>>
>>
>> On Wed, Jan 5, 2011 at 8:51 AM, Ran Tavory <rantav@gmail.com> wrote:
>>
>>> I haven't tried repair.  Should I?
>>> On Jan 5, 2011 3:48 PM, "Jake Luciani" <jakers@gmail.com> wrote:
>>> > Have you tried not bootstrapping but setting the token and manually
>>> calling
>>> > repair?
>>> >
>>> > On Wed, Jan 5, 2011 at 7:07 AM, Ran Tavory <rantav@gmail.com> wrote:
>>> >
>>> >> My conclusion is lame: I tried this on several hosts and saw the sam=
e
>>> >> behavior, the only way I was able to join new nodes was to first sta=
rt
>>> them
>>> >> when they are *not in* their own seeds list and after they
>>> >> finish transferring the data, then restart them with themselves *in*
>>> their
>>> >> own seeds list. After doing that the node would join the ring.
>>> >> This is either my misunderstanding or a bug, but the only place I
>>> found it
>>> >> documented stated that the new node should not be in its own seeds
>>> list.
>>> >> Version 0.6.6.
>>> >>
>>> >> On Wed, Jan 5, 2011 at 10:35 AM, David Boxenhorn <david@lookin2.com
>>> >wrote:
>>> >>
>>> >>> My nodes all have themselves in their list of seeds - always did -
>>> and
>>> >>> everything works. (You may ask why I did this. I don't know, I must
>>> have
>>> >>> copied it from an example somewhere.)
>>> >>>
>>> >>> On Wed, Jan 5, 2011 at 9:42 AM, Ran Tavory <rantav@gmail.com> wrote=
:
>>> >>>
>>> >>>> I was able to make the node join the ring but I'm confused.
>>> >>>> What I did is, first when adding the node, this node was not in th=
e
>>> seeds
>>> >>>> list of itself. AFAIK this is how it's supposed to be. So it was
>>> able to
>>> >>>> transfer all data to itself from other nodes but then it stayed in
>>> the
>>> >>>> bootstrapping state.
>>> >>>> So what I did (and I don't know why it works), is add this node to
>>> the
>>> >>>> seeds list in its own storage-conf.xml file. Then restart the serv=
er
>>> and
>>> >>>> then I finally see it in the ring...
>>> >>>> If I had added the node to the seeds list of itself when first
>>> joining
>>> >>>> it, it would not join the ring but if I do it in two phases it did
>>> work.
>>> >>>> So it's either my misunderstanding or a bug...
>>> >>>>
>>> >>>>
>>> >>>> On Wed, Jan 5, 2011 at 7:14 AM, Ran Tavory <rantav@gmail.com>
>>> wrote:
>>> >>>>
>>> >>>>> The new node does not see itself as part of the ring, it sees all
>>> others
>>> >>>>> but itself, so from that perspective the view is consistent.
>>> >>>>> The only problem is that the node never finishes to bootstrap. It
>>> stays
>>> >>>>> in this state for hours (It's been 20 hours now...)
>>> >>>>>
>>> >>>>>
>>> >>>>> $ bin/nodetool -p 9004 -h localhost streams
>>> >>>>>> Mode: Bootstrapping
>>> >>>>>> Not sending any streams.
>>> >>>>>> Not receiving any streams.
>>> >>>>>
>>> >>>>>
>>> >>>>> On Wed, Jan 5, 2011 at 1:20 AM, Nate McCall <nate@riptano.com>
>>> wrote:
>>> >>>>>
>>> >>>>>> Does the new node have itself in the list of seeds per chance?
>>> This
>>> >>>>>> could cause some issues if so.
>>> >>>>>>
>>> >>>>>> On Tue, Jan 4, 2011 at 4:10 PM, Ran Tavory <rantav@gmail.com>
>>> wrote:
>>> >>>>>> > I'm still at lost. I haven't been able to resolve this. I trie=
d
>>> >>>>>> > adding another node at a different location on the ring but th=
is
>>> node
>>> >>>>>> > too remains stuck in the bootstrapping state for many hours
>>> without
>>> >>>>>> > any of the other nodes being busy with anti compaction or
>>> anything
>>> >>>>>> > else. I don't know what's keeping it from finishing the
>>> bootstrap,no
>>> >>>>>> > CPU, no io, files were already streamed so what is it waiting
>>> for?
>>> >>>>>> > I read the release notes of 0.6.7 and 0.6.8 and there didn't
>>> seem to
>>> >>>>>> > be anything addressing a similar issue so I figured there was =
no
>>> >>>>>> point
>>> >>>>>> > in upgrading. But let me know if you think there is.
>>> >>>>>> > Or any other advice...
>>> >>>>>> >
>>> >>>>>> > On Tuesday, January 4, 2011, Ran Tavory <rantav@gmail.com>
>>> wrote:
>>> >>>>>> >> Thanks Jake, but unfortunately the streams directory is empty
>>> so I
>>> >>>>>> don't think that any of the nodes is anti-compacting data right
>>> now or had
>>> >>>>>> been in the past 5 hours. It seems that all the data was already
>>> transferred
>>> >>>>>> to the joining host but the joining node, after having received
>>> the data
>>> >>>>>> would still remain in bootstrapping mode and not join the cluste=
r.
>>> I'm not
>>> >>>>>> sure that *all* data was transferred (perhaps other nodes need t=
o
>>> transfer
>>> >>>>>> more data) but nothing is actually happening so I assume all has
>>> been moved.
>>> >>>>>> >> Perhaps it's a configuration error from my part. Should I use=
 I
>>> use
>>> >>>>>> AutoBootstrap=3Dtrue ? Anything else I should look out for in th=
e
>>> >>>>>> configuration file or something else?
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >> On Tue, Jan 4, 2011 at 4:08 PM, Jake Luciani <jakers@gmail.co=
m
>>> >
>>> >>>>>> wrote:
>>> >>>>>> >>
>>> >>>>>> >> In 0.6, locate the node doing anti-compaction and look in the
>>> >>>>>> "streams" subdirectory in the keyspace data dir to monitor the
>>> >>>>>> anti-compaction progress (it puts new SSTables for bootstrapping
>>> node in
>>> >>>>>> there)
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >> On Tue, Jan 4, 2011 at 8:01 AM, Ran Tavory <rantav@gmail.com>
>>> >>>>>> wrote:
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >> Running nodetool decommission didn't help. Actually the node
>>> refused
>>> >>>>>> to decommission itself (b/c it wasn't part of the ring). So I
>>> simply stopped
>>> >>>>>> the process, deleted all the data directories and started it
>>> again. It
>>> >>>>>> worked in the sense of the node bootstrapped again but as before=
,
>>> after it
>>> >>>>>> had finished moving the data nothing happened for a long time (I=
'm
>>> still
>>> >>>>>> waiting, but nothing seems to be happening).
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >> Any hints how to analyze a "stuck" bootstrapping node??thanks
>>> >>>>>> >> On Tue, Jan 4, 2011 at 1:51 PM, Ran Tavory <rantav@gmail.com>
>>> >>>>>> wrote:
>>> >>>>>> >> Thanks Shimi, so indeed anticompaction was run on one of the
>>> other
>>> >>>>>> nodes from the same DC but to my understanding it has already
>>> ended. A few
>>> >>>>>> hour ago...
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >> I plenty of log messages such as [1] which ended a couple of
>>> hours
>>> >>>>>> ago, and I've seen the new node streaming and accepting the data
>>> from the
>>> >>>>>> node which performed the anticompaction and so far it was normal
>>> so it
>>> >>>>>> seemed that data is at its right place. But now the new node see=
ms
>>> sort of
>>> >>>>>> stuck. None of the other nodes is anticompacting right now or ha=
d
>>> been
>>> >>>>>> anticompacting since then.
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >> The new node's CPU is close to zero, it's iostats are almost
>>> zero so
>>> >>>>>> I can't find another bottleneck that would keep it hanging.
>>> >>>>>> >> On the IRC someone suggested I'd maybe retry to join this nod=
e,
>>> >>>>>> e.g. decommission and rejoin it again. I'll try it now...
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >> [1] INFO [COMPACTION-POOL:1] 2011-01-04 04:04:09,721
>>> >>>>>> CompactionManager.java (line 338) AntiCompacting
>>> >>>>>>
>>> [org.apache.cassandra.io.SSTableReader(path=3D'/outbrain/cassandra/data=
/outbrain_kvdb/KvAds-6449-Data.db')]
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >> INFO [COMPACTION-POOL:1] 2011-01-04 04:34:18,683
>>> >>>>>> CompactionManager.java (line 338) AntiCompacting
>>> >>>>>>
>>> [org.apache.cassandra.io.SSTableReader(path=3D'/outbrain/cassandra/data=
/outbrain_kvdb/KvImpressions-3874-Data.db'),org.apache.cassandra.io.SSTable=
Reader(path=3D'/outbrain/cassandra/data/outbrain_kvdb/KvImpressions-3873-Da=
ta.db'),org.apache.cassandra.io.SSTableReader(path=3D'/outbrain/cassandra/d=
ata/outbrain_kvdb/KvImpressions-3876-Data.db')]
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >> INFO [COMPACTION-POOL:1] 2011-01-04 04:34:19,132
>>> >>>>>> CompactionManager.java (line 338) AntiCompacting
>>> >>>>>>
>>> [org.apache.cassandra.io.SSTableReader(path=3D'/outbrain/cassandra/data=
/outbrain_kvdb/KvRatings-951-Data.db'),org.apache.cassandra.io.SSTableReade=
r(path=3D'/outbrain/cassandra/data/outbrain_kvdb/KvRatings-976-Data.db'),or=
g.apache.cassandra.io.SSTableReader(path=3D'/outbrain/cassandra/data/outbra=
in_kvdb/KvRatings-978-Data.db')]
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >> INFO [COMPACTION-POOL:1] 2011-01-04 04:34:26,486
>>> >>>>>> CompactionManager.java (line 338) AntiCompacting
>>> >>>>>>
>>> [org.apache.cassandra.io.SSTableReader(path=3D'/outbrain/cassandra/data=
/outbrain_kvdb/KvAds-6449-Data.db')]
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >> On Tue, Jan 4, 2011 at 12:45 PM, shimi <shimi.k@gmail.com>
>>> wrote:
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >> In my experience most of the time it takes for a node to join
>>> the
>>> >>>>>> cluster is the anticompaction on the other nodes. The streaming
>>> part is very
>>> >>>>>> fast.
>>> >>>>>> >> Check the other nodes logs to see if there is any node doing
>>> >>>>>> anticompaction.I don't remember how much data I had in the clust=
er
>>> when I
>>> >>>>>> needed to add/remove nodes. I do remember that it took a few
>>> hours.
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >> The node will join the ring only when it will finish the
>>> bootstrap.
>>> >>>>>> >> --
>>> >>>>>> >> /Ran
>>> >>>>>> >>
>>> >>>>>> >>
>>> >>>>>> >
>>> >>>>>> > --
>>> >>>>>> > /Ran
>>> >>>>>> >
>>> >>>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> --
>>> >>>>> /Ran
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> /Ran
>>> >>>>
>>> >>>>
>>> >>>
>>> >>
>>> >>
>>> >> --
>>> >> /Ran
>>> >>
>>> >>
>>>
>>
>>
>


--=20
/Ran

--00c09fa9c4ff93bafa04991ab7df
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">In storage-conf I see this comment [1] from which I unders=
tand that the recommended way to bootstrap a new node is to set=C2=A0<meta =
charset=3D"utf-8">AutoBootstrap=3Dtrue and remove itself from the seeds lis=
t.<div>Moreover, I did try to set AutoBootstrap=3Dtrue and have the node in=
 its own seeds list, but it would not bootstrap. I don&#39;t recall the exa=
ct message but it was something like &quot;I found myself in the seeds list=
 therefore I&#39;m not going to bootstrap even though AutoBootstrap is true=
&quot;.=C2=A0<br>

<div><br></div><div>[1]</div><div><div>=C2=A0=C2=A0&lt;!--</div><div>=C2=A0=
=C2=A0 ~ Turn on to make new [non-seed] nodes automatically migrate the rig=
ht data=C2=A0</div><div>=C2=A0=C2=A0 ~ to themselves. =C2=A0(If no InitialT=
oken is specified, they will pick one=C2=A0</div>

<div>=C2=A0=C2=A0 ~ such that they will get half the range of the most-load=
ed node.)</div><div>=C2=A0=C2=A0 ~ If a node starts up without bootstrappin=
g, it will mark itself bootstrapped</div><div>=C2=A0=C2=A0 ~ so that you ca=
n&#39;t subsequently accidently bootstrap a node with</div>

<div>=C2=A0=C2=A0 ~ data on it. =C2=A0(You can reset this by wiping your da=
ta and commitlog</div><div>=C2=A0=C2=A0 ~ directories.)</div><div>=C2=A0=C2=
=A0 ~</div><div>=C2=A0=C2=A0 ~ Off by default so that new clusters and upgr=
aders from 0.4 don&#39;t</div><div>=C2=A0=C2=A0 ~ bootstrap immediately. =
=C2=A0You should turn this on when you start adding</div>

<div>=C2=A0=C2=A0 ~ new nodes to a cluster that already has data on it. =C2=
=A0(If you are upgrading</div><div>=C2=A0=C2=A0 ~ from 0.4, start your clus=
ter with it off once before changing it to true.</div><div>=C2=A0=C2=A0 ~ O=
therwise, no data will be lost but you will incur a lot of unnecessary</div=
>

<div>=C2=A0=C2=A0 ~ I/O before your cluster starts up.)</div><div>=C2=A0=C2=
=A0--&gt;</div><div>=C2=A0=C2=A0&lt;AutoBootstrap&gt;false&lt;/AutoBootstra=
p&gt;</div><br><div class=3D"gmail_quote">On Wed, Jan 5, 2011 at 4:58 PM, D=
avid Boxenhorn <span dir=3D"ltr">&lt;<a href=3D"mailto:david@lookin2.com">d=
avid@lookin2.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex;"><div dir=3D"ltr">If &quot;seed list should =
be the same across the cluster&quot; that means that nodes *should* have th=
emselves as a seed. If that doesn&#39;t work for Ran, then that is the firs=
t problem, no? <br>

<div><div></div><div class=3D"h5"><br><br><div class=3D"gmail_quote">
On Wed, Jan 5, 2011 at 3:56 PM, Jake Luciani <span dir=3D"ltr">&lt;<a href=
=3D"mailto:jakers@gmail.com" target=3D"_blank">jakers@gmail.com</a>&gt;</sp=
an> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0pt 0pt 0pt=
 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">


Well your ring issues don&#39;t make sense to me, seed list should be the s=
ame across the cluster.<div>I&#39;m just thinking of other things to try, n=
on-boostrapped nodes should join the ring instantly but reads will fail if =
you aren&#39;t using quorum.</div>


<div><div></div><div>
<div><br><br><div class=3D"gmail_quote">On Wed, Jan 5, 2011 at 8:51 AM, Ran=
 Tavory <span dir=3D"ltr">&lt;<a href=3D"mailto:rantav@gmail.com" target=3D=
"_blank">rantav@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gma=
il_quote" style=3D"margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, =
204, 204);padding-left:1ex">


<p>I haven&#39;t tried repair.=C2=A0 Should I?</p><div><div></div><div>
<div class=3D"gmail_quote">On Jan 5, 2011 3:48 PM, &quot;Jake Luciani&quot;=
 &lt;<a href=3D"mailto:jakers@gmail.com" target=3D"_blank">jakers@gmail.com=
</a>&gt; wrote:<br type=3D"attribution">&gt; Have you tried not bootstrappi=
ng but setting the token and manually calling<br>


&gt; repair?<br>&gt; <br>&gt; On Wed, Jan 5, 2011 at 7:07 AM, Ran Tavory &l=
t;<a href=3D"mailto:rantav@gmail.com" target=3D"_blank">rantav@gmail.com</a=
>&gt; wrote:<br>&gt; <br>&gt;&gt; My conclusion is lame: I tried this on se=
veral hosts and saw the same<br>


&gt;&gt; behavior, the only way I was able to join new nodes was to first s=
tart them<br>&gt;&gt; when they are *not in* their own seeds list and after=
 they<br>&gt;&gt; finish transferring the data, then restart them with them=
selves *in* their<br>


&gt;&gt; own seeds list. After doing that the node would join the ring.<br>=
&gt;&gt; This is either my misunderstanding or a bug, but the only place I =
found it<br>&gt;&gt; documented stated that the new node should not be in i=
ts own seeds list.<br>


&gt;&gt; Version 0.6.6.<br>&gt;&gt;<br>&gt;&gt; On Wed, Jan 5, 2011 at 10:3=
5 AM, David Boxenhorn &lt;<a href=3D"mailto:david@lookin2.com" target=3D"_b=
lank">david@lookin2.com</a>&gt;wrote:<br>&gt;&gt;<br>&gt;&gt;&gt; My nodes =
all have themselves in their list of seeds - always did - and<br>


&gt;&gt;&gt; everything works. (You may ask why I did this. I don&#39;t kno=
w, I must have<br>&gt;&gt;&gt; copied it from an example somewhere.)<br>&gt=
;&gt;&gt;<br>&gt;&gt;&gt; On Wed, Jan 5, 2011 at 9:42 AM, Ran Tavory &lt;<a=
 href=3D"mailto:rantav@gmail.com" target=3D"_blank">rantav@gmail.com</a>&gt=
; wrote:<br>


&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt; I was able to make the node join the ring =
but I&#39;m confused.<br>&gt;&gt;&gt;&gt; What I did is, first when adding =
the node, this node was not in the seeds<br>&gt;&gt;&gt;&gt; list of itself=
. AFAIK this is how it&#39;s supposed to be. So it was able to<br>


&gt;&gt;&gt;&gt; transfer all data to itself from other nodes but then it s=
tayed in the<br>&gt;&gt;&gt;&gt; bootstrapping state.<br>&gt;&gt;&gt;&gt; S=
o what I did (and I don&#39;t know why it works), is add this node to the<b=
r>


&gt;&gt;&gt;&gt; seeds list in its own storage-conf.xml file. Then restart =
the server and<br>&gt;&gt;&gt;&gt; then I finally see it in the ring...<br>=
&gt;&gt;&gt;&gt; If I had added the node to the seeds list of itself when f=
irst joining<br>


&gt;&gt;&gt;&gt; it, it would not join the ring but if I do it in two phase=
s it did work.<br>&gt;&gt;&gt;&gt; So it&#39;s either my misunderstanding o=
r a bug...<br>&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt; On W=
ed, Jan 5, 2011 at 7:14 AM, Ran Tavory &lt;<a href=3D"mailto:rantav@gmail.c=
om" target=3D"_blank">rantav@gmail.com</a>&gt; wrote:<br>


&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt; The new node does not see itself a=
s part of the ring, it sees all others<br>&gt;&gt;&gt;&gt;&gt; but itself, =
so from that perspective the view is consistent.<br>&gt;&gt;&gt;&gt;&gt; Th=
e only problem is that the node never finishes to bootstrap. It stays<br>


&gt;&gt;&gt;&gt;&gt; in this state for hours (It&#39;s been 20 hours now...=
)<br>&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt; $=
 bin/nodetool -p 9004 -h localhost streams<br>&gt;&gt;&gt;&gt;&gt;&gt; Mode=
: Bootstrapping<br>


&gt;&gt;&gt;&gt;&gt;&gt; Not sending any streams.<br>&gt;&gt;&gt;&gt;&gt;&g=
t; Not receiving any streams.<br>&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&g=
t;<br>&gt;&gt;&gt;&gt;&gt; On Wed, Jan 5, 2011 at 1:20 AM, Nate McCall &lt;=
<a href=3D"mailto:nate@riptano.com" target=3D"_blank">nate@riptano.com</a>&=
gt; wrote:<br>


&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; Does the new node have its=
elf in the list of seeds per chance? This<br>&gt;&gt;&gt;&gt;&gt;&gt; could=
 cause some issues if so.<br>&gt;&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&g=
t;&gt; On Tue, Jan 4, 2011 at 4:10 PM, Ran Tavory &lt;<a href=3D"mailto:ran=
tav@gmail.com" target=3D"_blank">rantav@gmail.com</a>&gt; wrote:<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt; I&#39;m still at lost.   I haven&#39;t been a=
ble to resolve this. I tried<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt; adding anothe=
r node at a different location on the ring but this node<br>&gt;&gt;&gt;&gt=
;&gt;&gt; &gt; too remains stuck in the bootstrapping state for many hours =
without<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt; any of the other nodes being busy with anti c=
ompaction or anything<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt; else. I don&#39;t kn=
ow what&#39;s keeping it from finishing the bootstrap,no<br>&gt;&gt;&gt;&gt=
;&gt;&gt; &gt; CPU, no io, files were already streamed so what is it waitin=
g for?<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt; I read the release notes of 0.6.7 and 0.6.8 a=
nd there didn&#39;t seem to<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt; be anything ad=
dressing a similar issue so I figured there was no<br>&gt;&gt;&gt;&gt;&gt;&=
gt; point<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt; in upgrading. But let me know if you think th=
ere is.<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt; Or any other advice...<br>&gt;&gt;=
&gt;&gt;&gt;&gt; &gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt; On Tuesday, January =
4, 2011, Ran Tavory &lt;<a href=3D"mailto:rantav@gmail.com" target=3D"_blan=
k">rantav@gmail.com</a>&gt; wrote:<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; Thanks Jake, but unfortunately the stream=
s directory is empty so I<br>&gt;&gt;&gt;&gt;&gt;&gt; don&#39;t think that =
any of the nodes is anti-compacting data right now or had<br>&gt;&gt;&gt;&g=
t;&gt;&gt; been in the past 5 hours. It seems that all the data was already=
 transferred<br>


&gt;&gt;&gt;&gt;&gt;&gt; to the joining host but the joining node, after ha=
ving received the data<br>&gt;&gt;&gt;&gt;&gt;&gt; would still remain in bo=
otstrapping mode and not join the cluster. I&#39;m not<br>&gt;&gt;&gt;&gt;&=
gt;&gt; sure that *all* data was transferred (perhaps other nodes need to t=
ransfer<br>


&gt;&gt;&gt;&gt;&gt;&gt; more data) but nothing is actually happening so I =
assume all has been moved.<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; Perhaps it&=
#39;s a configuration error from my part. Should I use I use<br>&gt;&gt;&gt=
;&gt;&gt;&gt; AutoBootstrap=3Dtrue ? Anything else I should look out for in=
 the<br>


&gt;&gt;&gt;&gt;&gt;&gt; configuration file or something else?<br>&gt;&gt;&=
gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&g=
t;&gt;&gt;&gt; &gt;&gt; On Tue, Jan 4, 2011 at 4:08 PM, Jake Luciani &lt;<a=
 href=3D"mailto:jakers@gmail.com" target=3D"_blank">jakers@gmail.com</a>&gt=
;<br>


&gt;&gt;&gt;&gt;&gt;&gt; wrote:<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt=
;&gt;&gt;&gt;&gt;&gt; &gt;&gt; In 0.6, locate the node doing anti-compactio=
n and look in the<br>&gt;&gt;&gt;&gt;&gt;&gt; &quot;streams&quot; subdirect=
ory in the keyspace data dir to monitor the<br>


&gt;&gt;&gt;&gt;&gt;&gt; anti-compaction progress (it puts new SSTables for=
 bootstrapping node in<br>&gt;&gt;&gt;&gt;&gt;&gt; there)<br>&gt;&gt;&gt;&g=
t;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt=
;&gt;&gt; &gt;&gt; On Tue, Jan 4, 2011 at 8:01 AM, Ran Tavory &lt;<a href=
=3D"mailto:rantav@gmail.com" target=3D"_blank">rantav@gmail.com</a>&gt;<br>


&gt;&gt;&gt;&gt;&gt;&gt; wrote:<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt=
;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; Running=
 nodetool decommission didn&#39;t help. Actually the node refused<br>&gt;&g=
t;&gt;&gt;&gt;&gt; to decommission itself (b/c it wasn&#39;t part of the ri=
ng). So I simply stopped<br>


&gt;&gt;&gt;&gt;&gt;&gt; the process, deleted all the data directories and =
started it again. It<br>&gt;&gt;&gt;&gt;&gt;&gt; worked in the sense of the=
 node bootstrapped again but as before, after it<br>&gt;&gt;&gt;&gt;&gt;&gt=
; had finished moving the data nothing happened for a long time (I&#39;m st=
ill<br>


&gt;&gt;&gt;&gt;&gt;&gt; waiting, but nothing seems to be happening).<br>&g=
t;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt=
;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; Any hints how to analyze a &quot;stuck&qu=
ot; bootstrapping node??thanks<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; On Tue,=
 Jan 4, 2011 at 1:51 PM, Ran Tavory &lt;<a href=3D"mailto:rantav@gmail.com"=
 target=3D"_blank">rantav@gmail.com</a>&gt;<br>


&gt;&gt;&gt;&gt;&gt;&gt; wrote:<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; Thanks=
 Shimi, so indeed anticompaction was run on one of the other<br>&gt;&gt;&gt=
;&gt;&gt;&gt; nodes from the same DC but to my understanding it has already=
 ended. A few<br>


&gt;&gt;&gt;&gt;&gt;&gt; hour ago...<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<b=
r>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br=
>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; I plenty of log messages such as [1] whi=
ch ended a couple of hours<br>


&gt;&gt;&gt;&gt;&gt;&gt; ago, and I&#39;ve seen the new node streaming and =
accepting the data from the<br>&gt;&gt;&gt;&gt;&gt;&gt; node which performe=
d the anticompaction and so far it was normal so it<br>&gt;&gt;&gt;&gt;&gt;=
&gt; seemed that data is at its right place. But now the new node seems sor=
t of<br>


&gt;&gt;&gt;&gt;&gt;&gt; stuck. None of the other nodes is anticompacting r=
ight now or had been<br>&gt;&gt;&gt;&gt;&gt;&gt; anticompacting since then.=
<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<=
br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&=
gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; The new node&#39;s CPU is close to zero, i=
t&#39;s iostats are almost zero so<br>&gt;&gt;&gt;&gt;&gt;&gt; I can&#39;t =
find another bottleneck that would keep it hanging.<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; On the IRC someone suggested I&#39;d mayb=
e retry to join this node,<br>&gt;&gt;&gt;&gt;&gt;&gt; e.g. decommission an=
d rejoin it again. I&#39;ll try it now...<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&=
gt;<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&=
gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&g=
t;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; [1] IN=
FO [COMPACTION-POOL:1] 2011-01-04 04:04:09,721<br>


&gt;&gt;&gt;&gt;&gt;&gt; CompactionManager.java (line 338) AntiCompacting<b=
r>&gt;&gt;&gt;&gt;&gt;&gt; [org.apache.cassandra.io.SSTableReader(path=3D&#=
39;/outbrain/cassandra/data/outbrain_kvdb/KvAds-6449-Data.db&#39;)]<br>&gt;=
&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&=
gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;  INFO=
 [COMPACTION-POOL:1] 2011-01-04 04:34:18,683<br>&gt;&gt;&gt;&gt;&gt;&gt; Co=
mpactionManager.java (line 338) AntiCompacting<br>


&gt;&gt;&gt;&gt;&gt;&gt; [org.apache.cassandra.io.SSTableReader(path=3D&#39=
;/outbrain/cassandra/data/outbrain_kvdb/KvImpressions-3874-Data.db&#39;),or=
g.apache.cassandra.io.SSTableReader(path=3D&#39;/outbrain/cassandra/data/ou=
tbrain_kvdb/KvImpressions-3873-Data.db&#39;),org.apache.cassandra.io.SSTabl=
eReader(path=3D&#39;/outbrain/cassandra/data/outbrain_kvdb/KvImpressions-38=
76-Data.db&#39;)]<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&=
gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&g=
t;&gt;&gt;&gt;&gt;&gt; &gt;&gt;  INFO [COMPACTION-POOL:1] 2011-01-04 04:34:=
19,132<br>


&gt;&gt;&gt;&gt;&gt;&gt; CompactionManager.java (line 338) AntiCompacting<b=
r>&gt;&gt;&gt;&gt;&gt;&gt; [org.apache.cassandra.io.SSTableReader(path=3D&#=
39;/outbrain/cassandra/data/outbrain_kvdb/KvRatings-951-Data.db&#39;),org.a=
pache.cassandra.io.SSTableReader(path=3D&#39;/outbrain/cassandra/data/outbr=
ain_kvdb/KvRatings-976-Data.db&#39;),org.apache.cassandra.io.SSTableReader(=
path=3D&#39;/outbrain/cassandra/data/outbrain_kvdb/KvRatings-978-Data.db=
9;)]<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&=
gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&g=
t;&gt;&gt;&gt;&gt;&gt; &gt;&gt;  INFO [COMPACTION-POOL:1] 2011-01-04 04:34:=
26,486<br>


&gt;&gt;&gt;&gt;&gt;&gt; CompactionManager.java (line 338) AntiCompacting<b=
r>&gt;&gt;&gt;&gt;&gt;&gt; [org.apache.cassandra.io.SSTableReader(path=3D&#=
39;/outbrain/cassandra/data/outbrain_kvdb/KvAds-6449-Data.db&#39;)]<br>&gt;=
&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&=
gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&g=
t;&gt;&gt;&gt;&gt;&gt; &gt;&gt; On Tue, Jan 4, 2011 at 12:45 PM, shimi &lt;=
<a href=3D"mailto:shimi.k@gmail.com" target=3D"_blank">shimi.k@gmail.com</a=
>&gt; wrote:<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&=
gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&g=
t;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; In my =
experience most of the time it takes for a node to join the<br>


&gt;&gt;&gt;&gt;&gt;&gt; cluster is the anticompaction on the other nodes. =
The streaming part is very<br>&gt;&gt;&gt;&gt;&gt;&gt; fast.<br>&gt;&gt;&gt=
;&gt;&gt;&gt; &gt;&gt; Check the other nodes logs to see if there is any no=
de doing<br>


&gt;&gt;&gt;&gt;&gt;&gt; anticompaction.I don&#39;t remember how much data =
I had in the cluster when I<br>&gt;&gt;&gt;&gt;&gt;&gt; needed to add/remov=
e nodes. I do remember that it took a few hours.<br>&gt;&gt;&gt;&gt;&gt;&gt=
; &gt;&gt;<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&=
gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&g=
t;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; The no=
de will join the ring only when it will finish the bootstrap.<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; --<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt; /=
Ran<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;&g=
t;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt; --<br>


&gt;&gt;&gt;&gt;&gt;&gt; &gt; /Ran<br>&gt;&gt;&gt;&gt;&gt;&gt; &gt;<br>&gt;=
&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;<br>&gt=
;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt; --<br>&gt;&gt;&gt;&gt;&gt; /Ran<b=
r>


&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;&gt;<br>&gt;&gt=
;&gt;&gt;<br>&gt;&gt;&gt;&gt; --<br>&gt;&gt;&gt;&gt; /Ran<br>&gt;&gt;&gt;&g=
t;<br>&gt;&gt;&gt;&gt;<br>&gt;&gt;&gt;<br>&gt;&gt;<br>&gt;&gt;<br>&gt;&gt; =
--<br>


&gt;&gt; /Ran<br>&gt;&gt;<br>&gt;&gt;<br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br><br clear=3D"all"><br>-- <br><div dir=3D"ltr"><font =
color=3D"#999999">/Ran</font></div><br>
</div></div></div>

--00c09fa9c4ff93bafa04991ab7df--