incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony Molinaro <antho...@alumni.caltech.edu>
Subject Re: Bootstrap question
Date Wed, 21 Jul 2010 19:14:46 GMT
Sure, looks like that's in 0.6.4, so I'll probably just rebuild my server
based on the 0.6 branch, unless you want me to test just the patch for
1221?  Most likely won't get a chance to try until tomorrow, so let me
know.

Thanks,

-Anthony

On Wed, Jul 21, 2010 at 06:58:13AM -0500, Gary Dusbabek wrote:
> Anthony,
> 
> I think you're seeing the results of CASSANDRA-1221.  Each node has
> two connections with its peers.  One connection is used for gossip,
> the other for exchanging commands.  What you see with 1221 is the
> command socket getting 'stuck' after a peer is convicted by gossip and
> then recovers.  It doesn't happen every time, but it happens much of
> the time, especially with streaming.  I was able to reproduce this at
> will using loadbalance, but never tried it under bootstrap (where the
> bootstrapping IP was previously visible on the cluster), but it seems
> very plausible.
> 
> Any chance you could apply the patch for 1221 and test?
> 
> Gary.
> 
> On Tue, Jul 20, 2010 at 16:45, Anthony Molinaro
> <anthonym@alumni.caltech.edu> wrote:
> > I see this in the old nodes
> >
> > DEBUG [WRITE-/10.220.198.15] 2010-07-20 21:15:50,366 OutboundTcpConnection.java
(line 142) attempting to connect to /10.220.198.15
> > INFO [GMFD:1] 2010-07-20 21:15:50,391 Gossiper.java (line 586) Node /10.220.198.15
is now part of the cluster
> > INFO [GMFD:1] 2010-07-20 21:15:51,369 Gossiper.java (line 578) InetAddress /10.220.198.15
is now UP
> > INFO [HINTED-HANDOFF-POOL:1] 2010-07-20 21:15:51,369 HintedHandOffManager.java (line
153) Started hinted handoff for endPoint /10.220.198.15
> > INFO [HINTED-HANDOFF-POOL:1] 2010-07-20 21:15:51,371 HintedHandOffManager.java (line
210) Finished hinted handoff of 0 rows to endpoint /10.220.198.15
> > DEBUG [GMFD:1] 2010-07-20 21:17:20,551 StorageService.java (line 512) Node
> > /10.220.198.15 state bootstrapping, token 28356863910078205288614550619314017621
> > DEBUG [GMFD:1] 2010-07-20 21:17:20,656
> > StorageService.java (line 746) Pending ranges:
> > /10.220.198.15:(21604748163853165203168832909938143241,28356863910078205288614550619314017621]
> > /10.220.198.15:(10637639655367601517656788464652024082,21604748163853165203168832909938143241]
> >
> > 10.220.198.15 is the new node
> >
> > The key ranges seem to be for the primary and replica ranges.
> >
> > So after that, I would expect some AntiCompaction to happen on some of the
> > other nodes, but I don't see anything.
> >
> > Any clues from that output?
> >
> > I did not muck around with the Location tables.
> >
> > -Anthony
> >
> > On Mon, Jul 19, 2010 at 09:36:22PM -0500, Jonathan Ellis wrote:
> >> What gets logged on the old nodes at debug, when you try to add a
> >> single new machine after a full cluster restart?
> >>
> >> Removing Location would blow away the nodes' token information...  It
> >> should be safe if you set the InitialToken to what it used to be on
> >> each machine before bringing it up after nuking those.  Better
> >> snapshot the system keyspace first, just in case.
> >>
> >> On Sun, Jul 18, 2010 at 2:01 PM, Anthony Molinaro
> >> <anthonym@alumni.caltech.edu> wrote:
> >> > Yeah, I tried all that already and it didn't seem to work, no new nodes
> >> > will bootstrap, which makes me think there's some saved state somewhere,
> >> > preventing a new node from bootstrapping.  I think maybe the Location
> >> > sstables?  Is it safe to nuke those on all hosts and restart everything?
> >> > (I just don't want to lose actual data).
> >> >
> >> > Thanks for the ideas,
> >> >
> >> > -Anthony
> >> >
> >> > On Sun, Jul 18, 2010 at 08:09:45PM +0300, shimi wrote:
> >> >> If I have problems with never ending bootstraping I do the following.
I try
> >> >> each one if it doesn't help I try the next. It might not be the right
thing
> >> >> to do but it worked for me.
> >> >>
> >> >> 1. Restart the bootstraping node
> >> >> 2. If I see streaming 0/xxxx I restart the node and all the streaming
nodes
> >> >> 3. Restart all the nodes
> >> >> 4. If there is data in the bootstraing node I delete it before I restart.
> >> >>
> >> >> Good luck
> >> >> Shimi
> >> >>
> >> >> On Sun, Jul 18, 2010 at 12:21 AM, Anthony Molinaro <
> >> >> anthonym@alumni.caltech.edu> wrote:
> >> >>
> >> >> > So still waiting for any sort of answer on this one.  The cluster
still
> >> >> > refuses to do anything when I bring up new nodes.  I shut down
all the
> >> >> > new nodes and am waiting.  I'm guessing that maybe the old nodes
have
> >> >> > some state which needs to get cleared out?  Is there anything
I can do
> >> >> > at this point?  Are there alternate strategies for bootstrapping
I can
> >> >> > try?  (For instance can I just scp all the sstables to all the
new
> >> >> > nodes and do a repair, would that actually work?).
> >> >> >
> >> >> > Anyone seen this sort of issue?  All this is with 0.6.3 so I
assume
> >> >> > eventually others will see this issue.
> >> >> >
> >> >> > -Anthony
> >> >> >
> >> >> > On Thu, Jul 15, 2010 at 10:45:08PM -0700, Anthony Molinaro wrote:
> >> >> > > Okay, so things were pretty messed up.  I shut down all
the new nodes,
> >> >> > > then the old nodes started doing the half the ring is down
garbage which
> >> >> > > pretty much requires a full restart of everything.  So I
had to shut
> >> >> > > everything down, then bring the seed back, then the rest
of the nodes,
> >> >> > > so they finally all agreed on the ring again.
> >> >> > >
> >> >> > > Then I started one of the new nodes, and have been watching
the logs, so
> >> >> > > far 2 hours since the "Bootstrapping" message appeared in
the new
> >> >> > > log and nothing has happened.  No anticompaction messages
anywhere,
> >> >> > there's
> >> >> > > one node compacting, but its on the other end of the ring,
so no where
> >> >> > near
> >> >> > > that new node.  I'm wondering if it will ever get data at
this point.
> >> >> > >
> >> >> > > Is there something else I should try?  The only thing I
can think of
> >> >> > > is deleting the system directory on the new node, and restarting,
so
> >> >> > > I'll try that and see if it does anything.
> >> >> > >
> >> >> > > -Anthony
> >> >> > >
> >> >> > > On Thu, Jul 15, 2010 at 03:43:49PM -0500, Jonathan Ellis
wrote:
> >> >> > > > On Thu, Jul 15, 2010 at 3:28 PM, Anthony Molinaro
> >> >> > > > <anthonym@alumni.caltech.edu> wrote:
> >> >> > > > > Is the fact that 2 new nodes are in the range messing
it up?
> >> >> > > >
> >> >> > > > Probably.
> >> >> > > >
> >> >> > > > >  And if so
> >> >> > > > > how do I recover (I'm thinking, shutdown new nodes
2,3,4,5, the
> >> >> > bringing
> >> >> > > > > up nodes 2,4, waiting for them to finish, then
bringing up 3,5?).
> >> >> > > >
> >> >> > > > Yes.
> >> >> > > >
> >> >> > > > You might have to restart the old nodes too to clear
out the confusion.
> >> >> > > >
> >> >> > > > --
> >> >> > > > Jonathan Ellis
> >> >> > > > Project Chair, Apache Cassandra
> >> >> > > > co-founder of Riptano, the source for professional Cassandra
support
> >> >> > > > http://riptano.com
> >> >> > >
> >> >> > > --
> >> >> > > ------------------------------------------------------------------------
> >> >> > > Anthony Molinaro                           <anthonym@alumni.caltech.edu>
> >> >> >
> >> >> > --
> >> >> > ------------------------------------------------------------------------
> >> >> > Anthony Molinaro                           <anthonym@alumni.caltech.edu>
> >> >> >
> >> >
> >> > --
> >> > ------------------------------------------------------------------------
> >> > Anthony Molinaro                           <anthonym@alumni.caltech.edu>
> >> >
> >>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder of Riptano, the source for professional Cassandra support
> >> http://riptano.com
> >
> > --
> > ------------------------------------------------------------------------
> > Anthony Molinaro                           <anthonym@alumni.caltech.edu>
> >

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <anthonym@alumni.caltech.edu>

Mime
View raw message