cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Dusbabek <gdusba...@gmail.com>
Subject Re: Bootstrap question
Date Wed, 21 Jul 2010 11:58:13 GMT
Anthony,

I think you're seeing the results of CASSANDRA-1221.  Each node has
two connections with its peers.  One connection is used for gossip,
the other for exchanging commands.  What you see with 1221 is the
command socket getting 'stuck' after a peer is convicted by gossip and
then recovers.  It doesn't happen every time, but it happens much of
the time, especially with streaming.  I was able to reproduce this at
will using loadbalance, but never tried it under bootstrap (where the
bootstrapping IP was previously visible on the cluster), but it seems
very plausible.

Any chance you could apply the patch for 1221 and test?

Gary.

On Tue, Jul 20, 2010 at 16:45, Anthony Molinaro
<anthonym@alumni.caltech.edu> wrote:
> I see this in the old nodes
>
> DEBUG [WRITE-/10.220.198.15] 2010-07-20 21:15:50,366 OutboundTcpConnection.java (line
142) attempting to connect to /10.220.198.15
> INFO [GMFD:1] 2010-07-20 21:15:50,391 Gossiper.java (line 586) Node /10.220.198.15 is
now part of the cluster
> INFO [GMFD:1] 2010-07-20 21:15:51,369 Gossiper.java (line 578) InetAddress /10.220.198.15
is now UP
> INFO [HINTED-HANDOFF-POOL:1] 2010-07-20 21:15:51,369 HintedHandOffManager.java (line
153) Started hinted handoff for endPoint /10.220.198.15
> INFO [HINTED-HANDOFF-POOL:1] 2010-07-20 21:15:51,371 HintedHandOffManager.java (line
210) Finished hinted handoff of 0 rows to endpoint /10.220.198.15
> DEBUG [GMFD:1] 2010-07-20 21:17:20,551 StorageService.java (line 512) Node
> /10.220.198.15 state bootstrapping, token 28356863910078205288614550619314017621
> DEBUG [GMFD:1] 2010-07-20 21:17:20,656
> StorageService.java (line 746) Pending ranges:
> /10.220.198.15:(21604748163853165203168832909938143241,28356863910078205288614550619314017621]
> /10.220.198.15:(10637639655367601517656788464652024082,21604748163853165203168832909938143241]
>
> 10.220.198.15 is the new node
>
> The key ranges seem to be for the primary and replica ranges.
>
> So after that, I would expect some AntiCompaction to happen on some of the
> other nodes, but I don't see anything.
>
> Any clues from that output?
>
> I did not muck around with the Location tables.
>
> -Anthony
>
> On Mon, Jul 19, 2010 at 09:36:22PM -0500, Jonathan Ellis wrote:
>> What gets logged on the old nodes at debug, when you try to add a
>> single new machine after a full cluster restart?
>>
>> Removing Location would blow away the nodes' token information...  It
>> should be safe if you set the InitialToken to what it used to be on
>> each machine before bringing it up after nuking those.  Better
>> snapshot the system keyspace first, just in case.
>>
>> On Sun, Jul 18, 2010 at 2:01 PM, Anthony Molinaro
>> <anthonym@alumni.caltech.edu> wrote:
>> > Yeah, I tried all that already and it didn't seem to work, no new nodes
>> > will bootstrap, which makes me think there's some saved state somewhere,
>> > preventing a new node from bootstrapping.  I think maybe the Location
>> > sstables?  Is it safe to nuke those on all hosts and restart everything?
>> > (I just don't want to lose actual data).
>> >
>> > Thanks for the ideas,
>> >
>> > -Anthony
>> >
>> > On Sun, Jul 18, 2010 at 08:09:45PM +0300, shimi wrote:
>> >> If I have problems with never ending bootstraping I do the following. I
try
>> >> each one if it doesn't help I try the next. It might not be the right thing
>> >> to do but it worked for me.
>> >>
>> >> 1. Restart the bootstraping node
>> >> 2. If I see streaming 0/xxxx I restart the node and all the streaming nodes
>> >> 3. Restart all the nodes
>> >> 4. If there is data in the bootstraing node I delete it before I restart.
>> >>
>> >> Good luck
>> >> Shimi
>> >>
>> >> On Sun, Jul 18, 2010 at 12:21 AM, Anthony Molinaro <
>> >> anthonym@alumni.caltech.edu> wrote:
>> >>
>> >> > So still waiting for any sort of answer on this one.  The cluster
still
>> >> > refuses to do anything when I bring up new nodes.  I shut down all
the
>> >> > new nodes and am waiting.  I'm guessing that maybe the old nodes have
>> >> > some state which needs to get cleared out?  Is there anything I can
do
>> >> > at this point?  Are there alternate strategies for bootstrapping I
can
>> >> > try?  (For instance can I just scp all the sstables to all the new
>> >> > nodes and do a repair, would that actually work?).
>> >> >
>> >> > Anyone seen this sort of issue?  All this is with 0.6.3 so I assume
>> >> > eventually others will see this issue.
>> >> >
>> >> > -Anthony
>> >> >
>> >> > On Thu, Jul 15, 2010 at 10:45:08PM -0700, Anthony Molinaro wrote:
>> >> > > Okay, so things were pretty messed up.  I shut down all the new
nodes,
>> >> > > then the old nodes started doing the half the ring is down garbage
which
>> >> > > pretty much requires a full restart of everything.  So I had
to shut
>> >> > > everything down, then bring the seed back, then the rest of the
nodes,
>> >> > > so they finally all agreed on the ring again.
>> >> > >
>> >> > > Then I started one of the new nodes, and have been watching the
logs, so
>> >> > > far 2 hours since the "Bootstrapping" message appeared in the
new
>> >> > > log and nothing has happened.  No anticompaction messages anywhere,
>> >> > there's
>> >> > > one node compacting, but its on the other end of the ring, so
no where
>> >> > near
>> >> > > that new node.  I'm wondering if it will ever get data at this
point.
>> >> > >
>> >> > > Is there something else I should try?  The only thing I can think
of
>> >> > > is deleting the system directory on the new node, and restarting,
so
>> >> > > I'll try that and see if it does anything.
>> >> > >
>> >> > > -Anthony
>> >> > >
>> >> > > On Thu, Jul 15, 2010 at 03:43:49PM -0500, Jonathan Ellis wrote:
>> >> > > > On Thu, Jul 15, 2010 at 3:28 PM, Anthony Molinaro
>> >> > > > <anthonym@alumni.caltech.edu> wrote:
>> >> > > > > Is the fact that 2 new nodes are in the range messing
it up?
>> >> > > >
>> >> > > > Probably.
>> >> > > >
>> >> > > > >  And if so
>> >> > > > > how do I recover (I'm thinking, shutdown new nodes 2,3,4,5,
the
>> >> > bringing
>> >> > > > > up nodes 2,4, waiting for them to finish, then bringing
up 3,5?).
>> >> > > >
>> >> > > > Yes.
>> >> > > >
>> >> > > > You might have to restart the old nodes too to clear out
the confusion.
>> >> > > >
>> >> > > > --
>> >> > > > Jonathan Ellis
>> >> > > > Project Chair, Apache Cassandra
>> >> > > > co-founder of Riptano, the source for professional Cassandra
support
>> >> > > > http://riptano.com
>> >> > >
>> >> > > --
>> >> > > ------------------------------------------------------------------------
>> >> > > Anthony Molinaro                           <anthonym@alumni.caltech.edu>
>> >> >
>> >> > --
>> >> > ------------------------------------------------------------------------
>> >> > Anthony Molinaro                           <anthonym@alumni.caltech.edu>
>> >> >
>> >
>> > --
>> > ------------------------------------------------------------------------
>> > Anthony Molinaro                           <anthonym@alumni.caltech.edu>
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
> --
> ------------------------------------------------------------------------
> Anthony Molinaro                           <anthonym@alumni.caltech.edu>
>

Mime
View raw message