zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rammohan ganapavarapu <rammohanga...@gmail.com>
Subject Re: Observer went down with Read timed out exception
Date Wed, 04 Jul 2018 17:22:41 GMT
Andor,

Thanks for your time, i am waiting for 3.5 stable version to upgrade. Log
says read timeout right, what kind of packet or data is it reading from
leader?

Ram

On Wed, Jul 4, 2018, 12:24 AM Andor Molnar <andor@cloudera.com.invalid>
wrote:

> Unfortunately I cannot imagine anything other than what Norbert already
> mentioned. If the followers were stable, a problem in the DC-DC link could
> explain why all the observers have gone in a moment. If it had been a
> problem with leader overloading, even the followers would have gone with
> the observers too.
>
> If none of these cases happened, I'm afraid I cannot help more. I'm not
> aware of a similar, existing issue. Maybe more senior devs can comment.
>
> However, your version is quite old. Most production clusters are running
> 3.4.6 or 3.4.9 as far as I'm concerned. You might want to upgrade to the
> latest stable version though which is 3.4.12 at the moment. 3.4.13 will be
> out soon as well.
>
> Regards,
> Andor
>
>
>
>
> On Tue, Jul 3, 2018 at 8:13 PM, rammohan ganapavarapu <
> rammohanganap@gmail.com> wrote:
>
> > Andor,
> >
> > Zk  version that i use is zk_version 3.4.5-1392090, built on 09/30/2012
> > 17:52 GMT
> > No Auth or encryption config
> > None my of network graphs showing any dip or unusual pattern thats why i
> am
> > thinking there may not be any n/w issue. I have those nodes in cloud so
> > checking with them to see if any n/w issue between regions.
> >
> > Thanks,
> > Ram
> >
> >
> > On Tue, Jul 3, 2018 at 6:29 AM Andor Molnar <andor@cloudera.com.invalid>
> > wrote:
> >
> > > Hi Rammohan,
> > >
> > > Would you please elaborate on the details of your cluster setup?
> > > Which ZooKeeper version do you use?
> > > Do you use authentication / encryption?
> > > Would you please attach config files and log files of other nodes like
> > > leader and followers?
> > >
> > > How did you make sure that there was no network problem at the time
> when
> > > issue happened?
> > > Would you please attach graphs / diagrams on the network traffic
> > including
> > > latency and bandwidth usage between the affected data centers?
> > >
> > > Regards,
> > > Andor
> > >
> > >
> > >
> > >
> > > On Tue, Jul 3, 2018 at 2:56 PM, rammohan ganapavarapu <
> > > rammohanganap@gmail.com> wrote:
> > >
> > > > Yes I am sure there is no network issues, if leader is busy in GC
> > > followers
> > > > on the same DC would have been shutdown as we right but it wasn't the
> > > case.
> > > >
> > > > On Tue, Jul 3, 2018, 1:56 AM Norbert Kalmar
> > <nkalmar@cloudera.com.invalid
> > > >
> > > > wrote:
> > > >
> > > > > Hi Ram,
> > > > >
> > > > > Are you sure there were no network error? For me, this looks like
> it
> > > > could
> > > > > be due to failed heartbeats (as shutdown was called after the
> > timeout).
> > > > >
> > > > > It is also possible the leader was busy (maybe garbage collection
> > > caused
> > > > > pause?) - especially if you store big(ish) chunks of data in
> > ZooKeeper.
> > > > > (There is plan to integrate JVMPauseMonitor to ZooKeeper for this
> > > reason
> > > > > actually).
> > > > >
> > > > > Regards,
> > > > > Norbert
> > > > >
> > > > > On Mon, Jul 2, 2018 at 9:13 PM rammohan ganapavarapu <
> > > > > rammohanganap@gmail.com> wrote:
> > > > >
> > > > > > All,
> > > > > >
> > > > > > I have multi data-center ldap cluster setup with other
> data-center
> > > with
> > > > > all
> > > > > > observers all of sudden all the observer threads went down with
> the
> > > > > > following message, any idea why they went down? We don't see
any
> > > > network
> > > > > > related issues between data-centers.
> > > > > >
> > > > > >
> > > > > > 2018-06-29 05:32:59,036 [myid:222] - WARN
> > > > > > [QuorumPeer[myid=222]/0:0:0:0:0:0:0:0:2181:Observer@79] -
> > Exception
> > > > when
> > > > > > observing the leader
> > > > > > java.net.SocketTimeoutException: Read timed out
> > > > > > at java.net.SocketInputStream.socketRead0(Native Method)
> > > > > > at java.net.SocketInputStream.socketRead(SocketInputStream.
> > java:116)
> > > > > > at java.net.SocketInputStream.read(SocketInputStream.java:170)
> > > > > > at java.net.SocketInputStream.read(SocketInputStream.java:141)
> > > > > > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> > > > > > at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> > > > > > at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > > > > > at org.apache.jute.BinaryInputArchive.readInt(
> > > > BinaryInputArchive.java:63)
> > > > > > at
> > > > > >
> > > > > >
> > > > > org.apache.zookeeper.server.quorum.QuorumPacket.
> > > > deserialize(QuorumPacket.java:83)
> > > > > > at
> > > > > >
> > > > > org.apache.jute.BinaryInputArchive.readRecord(
> > > > BinaryInputArchive.java:108)
> > > > > > at
> > > > > org.apache.zookeeper.server.quorum.Learner.readPacket(
> > Learner.java:152)
> > > > > > at
> > > > > >
> > > > > org.apache.zookeeper.server.quorum.Observer.observeLeader(
> > > > Observer.java:75)
> > > > > > at org.apache.zookeeper.server.quorum.QuorumPeer.run(
> > > > QuorumPeer.java:727)
> > > > > > 2018-06-29 05:32:59,244 [myid:222] - INFO
> > > > > > [QuorumPeer[myid=222]/0:0:0:0:0:0:0:0:2181:Observer@137] -
> > shutdown
> > > > > called
> > > > > > java.lang.Exception: shutdown Observer
> > > > > > at
> > > > > org.apache.zookeeper.server.quorum.Observer.shutdown(
> > Observer.java:137)
> > > > > > at org.apache.zookeeper.server.quorum.QuorumPeer.run(
> > > > QuorumPeer.java:731)
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Ram
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message