zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <...@apache.org>
Subject Re: node 2 not rejoining cluster
Date Thu, 14 Apr 2016 09:54:25 GMT
Other than some kind of funky packet filtering rule, I'm not sure why you'd not be receiving
the ACKs.

I think that reconfiguring isn't the right way of addressing the problem. If you have some
underlying issue, configuration or even bad hardware, then adding more nodes will not fix
it. Even worse, it might lurking there for some time and might come back to bite you later.

If you do lose a machine (e.g., permanent failure, decommission), then it does make sense
to reconfigure the ensemble.
 
-Flavio

  
> On 14 Apr 2016, at 01:12, s influxdb <elastic.l.k@gmail.com> wrote:
> 
> Thanks Flavio. 
> 
> Would you know why node2 could not receive ACK from the other 2 nodes .
> 
> What is the workaround in scenarios like these where in a 3 node cluster 1 node is not
responding
> ** If we do a rolling restart there is a possiblity of a downtime
> ** Add 2 more nodes to the configs and do a rolling restart
> ** Could you think of any way to fix node 2 so that it rejoins the cluster.
> 
> Would appreciate your reply.
> 
> 
> 
> On Tue, Apr 12, 2016 at 1:33 AM, Flavio Junqueira <fpj@apache.org <mailto:fpj@apache.org>>
wrote:
> Good to hear you've been able to sort it out.
> 
> -Flavio
> 
> > On 12 Apr 2016, at 03:02, s influxdb <elastic.l.k@gmail.com <mailto:elastic.l.k@gmail.com>>
wrote:
> >
> > created a parallel independant zookeeper cluster on the same set of
> > machines with different ports and that worked. This indicates the port was
> > the issue.
> >
> > On Mon, Apr 11, 2016 at 1:35 PM, s influxdb <elastic.l.k@gmail.com <mailto:elastic.l.k@gmail.com>>
wrote:
> >
> >> reboot of the server didn't help
> >>
> >> On Thu, Apr 7, 2016 at 6:50 PM, s influxdb <elastic.l.k@gmail.com <mailto:elastic.l.k@gmail.com>>
wrote:
> >>
> >>> I ran tcpdump on all the three nodes.
> >>> It looks like that for every  [PSH, ACK] there is a missing [ACK] from
> >>> the other nodes to this 2nd node on port 3888.
> >>>
> >>>
> >>> On Thu, Apr 7, 2016 at 1:29 PM, s influxdb <elastic.l.k@gmail.com <mailto:elastic.l.k@gmail.com>>
wrote:
> >>>
> >>>> Thanks Flavio for your quick replies.
> >>>> The zookeeper version is 3.4.6
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Apr 7, 2016 at 1:23 PM, Flavio P JUNQUEIRA <fpj@apache.org
<mailto:fpj@apache.org>>
> >>>> wrote:
> >>>>
> >>>>> You need to determine why it is not receiving notification messages.
> >>>>> From
> >>>>> the information you've given, it doesn't look like a zookeeper code
> >>>>> issue.
> >>>>>
> >>>>> BTW, which version are you using?
> >>>>>
> >>>>> -Flavio
> >>>>> On 7 Apr 2016 21:20, "s influxdb" <elastic.l.k@gmail.com <mailto:elastic.l.k@gmail.com>>
wrote:
> >>>>>
> >>>>>> nothin on the iptables firewall .
> >>>>>>
> >>>>>> What options do i have to reconnect this node to the cluster
?
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Apr 7, 2016 at 10:14 AM, s influxdb <elastic.l.k@gmail.com
<mailto:elastic.l.k@gmail.com>>
> >>>>> wrote:
> >>>>>>
> >>>>>>> telnet works on 2888 and 3888 to the other nodes. Now i
see
> >>>>>>> java.net.SocketTimeoutException: connect timed out messages
in the
> >>>>> logs
> >>>>>> for
> >>>>>>> node 2
> >>>>>>>
> >>>>>>> On Thu, Apr 7, 2016 at 3:05 AM, Flavio Junqueira <fpj@apache.org
<mailto:fpj@apache.org>>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> I only see notifications from the node to itself. It
says that it
> >>>>> is
> >>>>>>>> connected to 1, but it doesn't seem to be receiving
the
> >>>>> notification
> >>>>>> from
> >>>>>>>> 1. It also doesn't seem to be receiving the connection
request
> >>>>> from 3.
> >>>>>>>>
> >>>>>>>> Last time I've seen something like this was due to iptables
rules,
> >>>>> but
> >>>>>> if
> >>>>>>>> it was working before and no configuration has changed,
then I
> >>>>> don't
> >>>>>> know
> >>>>>>>> what it could be.
> >>>>>>>>
> >>>>>>>> -Flavio
> >>>>>>>>
> >>>>>>>>> On 07 Apr 2016, at 05:43, s influxdb <elastic.l.k@gmail.com
<mailto:elastic.l.k@gmail.com>>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> this is the pastie
> >>>>>>>>> http://pastie.org/10788301 <http://pastie.org/10788301>
> >>>>>>>>>
> >>>>>>>>> On Wed, Apr 6, 2016 at 9:41 PM, s influxdb <
> >>>>> elastic.l.k@gmail.com <mailto:elastic.l.k@gmail.com>>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> We had one of the node giving OOM java.lang.OutOfMemoryError:
> >>>>> unable
> >>>>>> to
> >>>>>>>>>> create new native thread and then being unresponsive.
> >>>>>>>>>>
> >>>>>>>>>> We tried to add the node back to the cluster
but with no luck.
> >>>>>>>>>>
> >>>>>>>>>> It doesn't seem to "Receive any notification
"  messages from
> >>>>> the
> >>>>>> other
> >>>>>>>>>> nodes.
> >>>>>>>>>> Keeps "Sending notifications " in loop
> >>>>>>>>>>
> >>>>>>>>>> Please see attached the logs of the node that
is out of
> >>>>> rotation.
> >>>>>>>>>>
> >>>>>>>>>> Any inputs appreciated.
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message