zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andor Molnar <an...@cloudera.com.INVALID>
Subject Re: Observer went down with Read timed out exception
Date Wed, 04 Jul 2018 07:24:46 GMT
Unfortunately I cannot imagine anything other than what Norbert already
mentioned. If the followers were stable, a problem in the DC-DC link could
explain why all the observers have gone in a moment. If it had been a
problem with leader overloading, even the followers would have gone with
the observers too.

If none of these cases happened, I'm afraid I cannot help more. I'm not
aware of a similar, existing issue. Maybe more senior devs can comment.

However, your version is quite old. Most production clusters are running
3.4.6 or 3.4.9 as far as I'm concerned. You might want to upgrade to the
latest stable version though which is 3.4.12 at the moment. 3.4.13 will be
out soon as well.

Regards,
Andor




On Tue, Jul 3, 2018 at 8:13 PM, rammohan ganapavarapu <
rammohanganap@gmail.com> wrote:

> Andor,
>
> Zk  version that i use is zk_version 3.4.5-1392090, built on 09/30/2012
> 17:52 GMT
> No Auth or encryption config
> None my of network graphs showing any dip or unusual pattern thats why i am
> thinking there may not be any n/w issue. I have those nodes in cloud so
> checking with them to see if any n/w issue between regions.
>
> Thanks,
> Ram
>
>
> On Tue, Jul 3, 2018 at 6:29 AM Andor Molnar <andor@cloudera.com.invalid>
> wrote:
>
> > Hi Rammohan,
> >
> > Would you please elaborate on the details of your cluster setup?
> > Which ZooKeeper version do you use?
> > Do you use authentication / encryption?
> > Would you please attach config files and log files of other nodes like
> > leader and followers?
> >
> > How did you make sure that there was no network problem at the time when
> > issue happened?
> > Would you please attach graphs / diagrams on the network traffic
> including
> > latency and bandwidth usage between the affected data centers?
> >
> > Regards,
> > Andor
> >
> >
> >
> >
> > On Tue, Jul 3, 2018 at 2:56 PM, rammohan ganapavarapu <
> > rammohanganap@gmail.com> wrote:
> >
> > > Yes I am sure there is no network issues, if leader is busy in GC
> > followers
> > > on the same DC would have been shutdown as we right but it wasn't the
> > case.
> > >
> > > On Tue, Jul 3, 2018, 1:56 AM Norbert Kalmar
> <nkalmar@cloudera.com.invalid
> > >
> > > wrote:
> > >
> > > > Hi Ram,
> > > >
> > > > Are you sure there were no network error? For me, this looks like it
> > > could
> > > > be due to failed heartbeats (as shutdown was called after the
> timeout).
> > > >
> > > > It is also possible the leader was busy (maybe garbage collection
> > caused
> > > > pause?) - especially if you store big(ish) chunks of data in
> ZooKeeper.
> > > > (There is plan to integrate JVMPauseMonitor to ZooKeeper for this
> > reason
> > > > actually).
> > > >
> > > > Regards,
> > > > Norbert
> > > >
> > > > On Mon, Jul 2, 2018 at 9:13 PM rammohan ganapavarapu <
> > > > rammohanganap@gmail.com> wrote:
> > > >
> > > > > All,
> > > > >
> > > > > I have multi data-center ldap cluster setup with other data-center
> > with
> > > > all
> > > > > observers all of sudden all the observer threads went down with the
> > > > > following message, any idea why they went down? We don't see any
> > > network
> > > > > related issues between data-centers.
> > > > >
> > > > >
> > > > > 2018-06-29 05:32:59,036 [myid:222] - WARN
> > > > > [QuorumPeer[myid=222]/0:0:0:0:0:0:0:0:2181:Observer@79] -
> Exception
> > > when
> > > > > observing the leader
> > > > > java.net.SocketTimeoutException: Read timed out
> > > > > at java.net.SocketInputStream.socketRead0(Native Method)
> > > > > at java.net.SocketInputStream.socketRead(SocketInputStream.
> java:116)
> > > > > at java.net.SocketInputStream.read(SocketInputStream.java:170)
> > > > > at java.net.SocketInputStream.read(SocketInputStream.java:141)
> > > > > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> > > > > at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> > > > > at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > > > > at org.apache.jute.BinaryInputArchive.readInt(
> > > BinaryInputArchive.java:63)
> > > > > at
> > > > >
> > > > >
> > > > org.apache.zookeeper.server.quorum.QuorumPacket.
> > > deserialize(QuorumPacket.java:83)
> > > > > at
> > > > >
> > > > org.apache.jute.BinaryInputArchive.readRecord(
> > > BinaryInputArchive.java:108)
> > > > > at
> > > > org.apache.zookeeper.server.quorum.Learner.readPacket(
> Learner.java:152)
> > > > > at
> > > > >
> > > > org.apache.zookeeper.server.quorum.Observer.observeLeader(
> > > Observer.java:75)
> > > > > at org.apache.zookeeper.server.quorum.QuorumPeer.run(
> > > QuorumPeer.java:727)
> > > > > 2018-06-29 05:32:59,244 [myid:222] - INFO
> > > > > [QuorumPeer[myid=222]/0:0:0:0:0:0:0:0:2181:Observer@137] -
> shutdown
> > > > called
> > > > > java.lang.Exception: shutdown Observer
> > > > > at
> > > > org.apache.zookeeper.server.quorum.Observer.shutdown(
> Observer.java:137)
> > > > > at org.apache.zookeeper.server.quorum.QuorumPeer.run(
> > > QuorumPeer.java:731)
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Ram
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message