zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deepak Jagtap <deepak.jag...@maxta.com>
Subject Re: zk server falling apart from quorum due to connection loss and couldn't connect back
Date Tue, 28 Jan 2014 20:44:08 GMT
Hi German,

I went through the zookeeper logs again and it looks like a zookeeper bug
to me.
Leader election was initiated and it never completed as one zookeeper
server went in zombie (hung) state.
Please note that zookeeper was running all the nodes when this happened.

Thanks & Regards,
Deepak




On Mon, Jan 27, 2014 at 1:12 PM, Deepak Jagtap <deepak.jagtap@maxta.com>wrote:

> Doprbox link for log files:
> https://dl.dropboxusercontent.com/u/36429721/zklog.tgz
>
>
> On Mon, Jan 27, 2014 at 1:12 PM, Deepak Jagtap <deepak.jagtap@maxta.com>wrote:
>
>> Jira has attachment limit of 10MB, hence uploaded log files on dropbox.
>>
>> *Please refer events close to  *
>>
>> *"2014-01-07 10:34:01" *
>>
>> *timestamp on all nodes.*
>>
>>
>> * Thanks & Regards,*
>>
>> *Deepak *
>>
>>
>> On Mon, Jan 27, 2014 at 12:34 PM, German Blanco <
>> german.blanco.blanco@gmail.com> wrote:
>>
>>> I don't see why it would be a problem for anybody.
>>> If this happens not to be a problem in ZooKeeper we can always close the
>>> bug case.
>>>
>>>
>>> On Mon, Jan 27, 2014 at 8:33 PM, Deepak Jagtap <deepak.jagtap@maxta.com
>>> >wrote:
>>>
>>> > Hi German,
>>> >
>>> > Thanks for the followup!
>>> > I have log files for all the servers and are quite big (greater than
>>>  25MB)
>>> > hence could not
>>> > upload send the log files through mail.
>>> > Is it ok if I file a bug on this this and upload logs there?
>>> >
>>> > Thanks & Regards,
>>> > Deepak
>>> >
>>> >
>>> >
>>> > On Sun, Jan 26, 2014 at 1:53 AM, German Blanco <
>>> > german.blanco.blanco@gmail.com> wrote:
>>> >
>>> > > Hello Deepak,
>>> > >
>>> > > sorry for the slow response.
>>> > > I can't figure out what might be going on here without the log files.
>>> > > The traces you see in S2 do not indicate any problem, as far as I
>>> see. It
>>> > > seems that you have a client running in S2 that tries to connect to
>>> that
>>> > > server. Since S2 hasn't been able to join a quorum, the server
>>> attending
>>> > > clients hasn't been started and the connection is rejected.
>>> > > Maybe, to start with, you could start by uploading the traces around
>>> the
>>> > > connection loss between S2 and S3 (say a couple of minutes before and
>>> > > after).
>>> > >
>>> > > Regards,
>>> > >
>>> > > German.
>>> > >
>>> > >
>>> > > On Thu, Jan 23, 2014 at 8:42 PM, Deepak Jagtap <
>>> deepak.jagtap@maxta.com
>>> > > >wrote:
>>> > >
>>> > > > Hi,
>>> > > >
>>> > > > zoo.cfg is :
>>> > > >
>>> > > > maxClientCnxns=50
>>> > > > # The number of milliseconds of each tick
>>> > > > tickTime=2000
>>> > > > # The number of ticks that the initial
>>> > > > # synchronization phase can take
>>> > > > initLimit=10
>>> > > > # The number of ticks that can pass between
>>> > > > # sending a request and getting an acknowledgement
>>> > > > syncLimit=5
>>> > > > # the directory where the snapshot is stored.
>>> > > > dataDir=/var/lib/zookeeper
>>> > > > # the port at which the clients will connect
>>> > > > clientPort=2181
>>> > > >
>>> > > > autopurge.snapRetainCount=3
>>> > > > autopurge.purgeInterval=1
>>> > > > dynamicConfigFile=/etc/maxta/zookeeper/conf/zoo.cfg.dynamic
>>> > > >
>>> > > >
>>> > > >
>>> > > > zoo.cfg.dynamic is:
>>> > > >
>>> > > > server.1=169.254.1.1:2888:3888:participant;0.0.0.0:2181
>>> > > > server.2=169.254.1.2:2888:3888:participant;0.0.0.0:2181
>>> > > > server.3=169.254.1.3:2888:3888:participant;0.0.0.0:2181
>>> > > > version=1
>>> > > >
>>> > > >
>>> > > > Thanks & Regards,
>>> > > > Deepak
>>> > > >
>>> > > >
>>> > > > On Thu, Jan 23, 2014 at 11:30 AM, German Blanco <
>>> > > > german.blanco.blanco@gmail.com> wrote:
>>> > > >
>>> > > > > Sorry but the attachment didn't make it through.
>>> > > > > It might be safer to put the files somewhere in the web and
send
>>> a
>>> > > link.
>>> > > > >
>>> > > > >
>>> > > > > On Thu, Jan 23, 2014 at 8:00 PM, Deepak Jagtap <
>>> > > deepak.jagtap@maxta.com
>>> > > > > >wrote:
>>> > > > >
>>> > > > > > Hi German,
>>> > > > > >
>>> > > > > > Please find zookeeper config files attached.
>>> > > > > >
>>> > > > > > Thanks & Regards,
>>> > > > > > Deepak
>>> > > > > >
>>> > > > > >
>>> > > > > > On Thu, Jan 23, 2014 at 12:59 AM, German Blanco <
>>> > > > > > german.blanco.blanco@gmail.com> wrote:
>>> > > > > >
>>> > > > > >> Hello!
>>> > > > > >>
>>> > > > > >> Could you please post your configuration files?
>>> > > > > >>
>>> > > > > >> Regards,
>>> > > > > >>
>>> > > > > >> German.
>>> > > > > >>
>>> > > > > >>
>>> > > > > >> On Thu, Jan 23, 2014 at 2:28 AM, Deepak Jagtap <
>>> > > > deepak.jagtap@maxta.com
>>> > > > > >> >wrote:
>>> > > > > >>
>>> > > > > >> > Hi All,
>>> > > > > >> >
>>> > > > > >> > We have deployed zookeeper version 3.5.0.1515976,
with 3 zk
>>> > > servers
>>> > > > in
>>> > > > > >> the
>>> > > > > >> > quorum.
>>> > > > > >> > The problem we are facing is that one zookeeper
server in
>>> the
>>> > > quorum
>>> > > > > >> falls
>>> > > > > >> > apart, and never becomes part of the cluster
until we
>>> restart
>>> > > > > zookeeper
>>> > > > > >> > server on that node.
>>> > > > > >> >
>>> > > > > >> > Our interpretation from zookeeper logs on all
nodes is as
>>> > follows:
>>> > > > > >> > (For simplicity assume S1=> zk server1,
S2 => zk server2,
>>> S3 =>
>>> > zk
>>> > > > > >> server
>>> > > > > >> > 3)
>>> > > > > >> > Initially S3 is the leader while S1 and S2
are followers.
>>> > > > > >> >
>>> > > > > >> > S2 hits 46 sec latency while fsyncing write
ahead log and
>>> > results
>>> > > in
>>> > > > > >> loss
>>> > > > > >> > of connection with S3.
>>> > > > > >> >  S3 in turn prints following error message:
>>> > > > > >> >
>>> > > > > >> > Unexpected exception causing shutdown while
sock still open
>>> > > > > >> > java.net.SocketTimeoutException: Read timed
out
>>> > > > > >> > Stack trace
>>> > > > > >> > ******* GOODBYE /169.254.1.2:47647(S2) ********
>>> > > > > >> >
>>> > > > > >> > S2 in this case closes connection with S3(leader)
and shuts
>>> down
>>> > > > > >> follower
>>> > > > > >> > with following log messages:
>>> > > > > >> > Closing connection to leader, exception during
packet send
>>> > > > > >> > java.net.SocketException: Socket close
>>> > > > > >> > Follower@194] - shutdown called
>>> > > > > >> > java.lang.Exception: shutdown Follower
>>> > > > > >> >
>>> > > > > >> > After this point S3 could never reestablish
connection with
>>> S2
>>> > and
>>> > > > > >> leader
>>> > > > > >> > election mechanism keeps failing. S3 now keeps
printing
>>> > following
>>> > > > > >> message
>>> > > > > >> > repeatedly:
>>> > > > > >> > Cannot open channel to 2 at election address
/
>>> 169.254.1.2:3888
>>> > > > > >> > java.net.ConnectException: Connection refused.
>>> > > > > >> >
>>> > > > > >> > While S3 is in this state, S2 repeatedly keeps
printing
>>> > following
>>> > > > > >> message:
>>> > > > > >> > INFO [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181
>>> > > > > >> > :NIOServerCnxnFactory$AcceptThread@296] - Accepted
socket
>>> > > > connection
>>> > > > > >> from
>>> > > > > >> > /
>>> > > > > >> > 127.0.0.1:60667
>>> > > > > >> > Exception causing close of session 0x0: ZooKeeperServer
not
>>> > > running
>>> > > > > >> > Closed socket connection for client /127.0.0.1:60667
(no
>>> > session
>>> > > > > >> > established for client)
>>> > > > > >> >
>>> > > > > >> > Leader election never completes successfully
and causing S2
>>> to
>>> > > fall
>>> > > > > >> apart
>>> > > > > >> > from the quorum.
>>> > > > > >> > S2 was out of quorum for almost 1 week.
>>> > > > > >> >
>>> > > > > >> > While debugging this issue, we found out that
both election
>>> and
>>> > > peer
>>> > > > > >> > connection ports on S2  can't be telneted from
any of the
>>> node
>>> > > (S1,
>>> > > > > S2,
>>> > > > > >> > S3). Network connectivity is not the issue.
Later, we
>>> restarted
>>> > > the
>>> > > > ZK
>>> > > > > >> > server S2 (service zookeeper-server restart)
-- now we could
>>> > > telnet
>>> > > > to
>>> > > > > >> both
>>> > > > > >> > the ports and S2 joined the ensemble after
a leader election
>>> > > > attempt.
>>> > > > > >> > Any idea what might be forcing S2 to get into
a situation
>>> where
>>> > it
>>> > > > > won't
>>> > > > > >> > accept any connections on the leader election
and peer
>>> > connection
>>> > > > > ports?
>>> > > > > >> >
>>> > > > > >> > Should I file a jira on this and upload all
log files while
>>> > > > submitting
>>> > > > > >> the
>>> > > > > >> > jira as log files are close to 250MB each?
>>> > > > > >> >
>>> > > > > >> > Thanks & Regards,
>>> > > > > >> > Deepak
>>> > > > > >> >
>>> > > > > >>
>>> > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message