zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Zimmerman <azimmer...@sproutsocial.com>
Subject Re: entire cluster dies with EOFException
Date Mon, 14 Jul 2014 15:20:55 GMT
Closing the loop on this, It appears that upping the initLimit did resolve
the issue.  Thanks all for the help.

Thanks,

Aaron Zimmerman


On Tue, Jul 8, 2014 at 4:40 PM, Flavio Junqueira <
fpjunqueira@yahoo.com.invalid> wrote:

> Agreed, but we need that check because we expect bytes for the checksum
> computation right underneath. The bit that's odd is that we make the same
> check again below:
>
>         try {
>                 long crcValue = ia.readLong("crcvalue");
>                 byte[] bytes = Util.readTxnBytes(ia);
>                 // Since we preallocate, we define EOF to be an
>                 if (bytes == null || bytes.length==0) {
>                     throw new EOFException("Failed to read " + logFile);
>                 }
>                 // EOF or corrupted record
>                 // validate CRC
>                 Checksum crc = makeChecksumAlgorithm();
>                 crc.update(bytes, 0, bytes.length);
>                 if (crcValue != crc.getValue())
>                     throw new IOException(CRC_ERROR);
>                 if (bytes == null || bytes.length == 0)
>                     return false;
>                 hdr = new TxnHeader();
>                 record = SerializeUtils.deserializeTxn(bytes, hdr);
>             } catch (EOFException e) {
>
> I'm moving this discussion, to the jira, btw.
>
> -Flavio
>
> On 07 Jul 2014, at 22:03, Aaron Zimmerman <azimmerman@sproutsocial.com>
> wrote:
>
> > Flavio,
> >
> > Yes that is the initial error, and then the nodes in the cluster are
> > restarted but fail to restart with
> >
> > 2014-07-04 12:58:52,734 [myid:1] - INFO  [main:FileSnap@83] - Reading
> > snapshot /var/lib/zookeeper/version-2/snapshot.300011fc0
> > 2014-07-04 12:58:52,896 [myid:1] - DEBUG
> > [main:FileTxnLog$FileTxnIterator@575] - Created new input stream
> > /var/lib/zookeeper/version-2/log.300000021
> > 2014-07-04 12:58:52,915 [myid:1] - DEBUG
> > [main:FileTxnLog$FileTxnIterator@578] - Created new input archive
> > /var/lib/zookeeper/version-2/log.300000021
> > 2014-07-04 12:59:25,870 [myid:1] - DEBUG
> > [main:FileTxnLog$FileTxnIterator@618] - EOF excepton
> java.io.EOFException:
> > Failed to read /var/lib/zookeeper/version-2/log.300000021
> > 2014-07-04 12:59:25,871 [myid:1] - DEBUG
> > [main:FileTxnLog$FileTxnIterator@575] - Created new input stream
> > /var/lib/zookeeper/version-2/log.300011fc2
> > 2014-07-04 12:59:25,872 [myid:1] - DEBUG
> > [main:FileTxnLog$FileTxnIterator@578] - Created new input archive
> > /var/lib/zookeeper/version-2/log.300011fc2
> > 2014-07-04 12:59:48,722 [myid:1] - DEBUG
> > [main:FileTxnLog$FileTxnIterator@618] - EOF excepton
> java.io.EOFException:
> > Failed to read /var/lib/zookeeper/version-2/log.300011fc2
> >
> > Thanks,
> >
> > AZ
> >
> >
> > On Mon, Jul 7, 2014 at 3:33 PM, Flavio Junqueira <
> > fpjunqueira@yahoo.com.invalid> wrote:
> >
> >> I'm a bit confused, the stack trace you reported was this one:
> >>
> >> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> >> following the leader
> >> java.io.EOFException
> >>       at java.io.DataInputStream.readInt(DataInputStream.java:375)
> >>       at
> >> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> >>       at
> >>
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> >>       at
> >>
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
> >>       at
> >> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
> >>       at
> >>
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
> >>       at
> >> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
> >>
> >>
> >> That's in a different part of the code.
> >>
> >> -Flavio
> >>
> >> On 07 Jul 2014, at 18:50, Aaron Zimmerman <azimmerman@sproutsocial.com>
> >> wrote:
> >>
> >>> Util.readTxnBytes reads from the buffer and if the length is 0, it
> return
> >>> the zero length array, seemingly indicating the end of the file.
> >>>
> >>> Then this is detected in FileTxnLog.java:671:
> >>>
> >>>               byte[] bytes = Util.readTxnBytes(ia);
> >>>               // Since we preallocate, we define EOF to be an
> >>>               if (bytes == null || bytes.length==0) {
> >>>                   throw new EOFException("Failed to read " + logFile);
> >>>               }
> >>>
> >>>
> >>> This exception is caught a few lines later, and the streams closed etc.
> >>>
> >>> So this seems to be not really an error condition, but a signal that
> the
> >>> entire file has been read? Is this exception a red herring?
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, Jul 7, 2014 at 11:50 AM, Raúl Gutiérrez Segalés <
> >> rgs@itevenworks.net
> >>>> wrote:
> >>>
> >>>> On 7 July 2014 09:39, Aaron Zimmerman <azimmerman@sproutsocial.com>
> >> wrote:
> >>>>
> >>>>> What I don't understand is how the entire cluster could die in such
a
> >>>>> situation.  I was able to load zookeeper locally using the snapshot
> and
> >>>> 10g
> >>>>> log file without apparent issue.
> >>>>
> >>>>
> >>>> Sure, but it's syncing up with other learners that becomes challenging
> >> when
> >>>> having either big snapshots or too many txnlogs, right?
> >>>>
> >>>>
> >>>>> I can see how large amounts of data could
> >>>>> cause latency issues in syncing causing a single worker to die,
but
> how
> >>>>> would that explain the node's inability to restart?  When the server
> >>>>> replays the log file, does it have to sync the transactions to other
> >>>> nodes
> >>>>> while it does so?
> >>>>>
> >>>>
> >>>> Given that your txn churn is so big, by the time it finished up
> reading
> >>>> from disc it'll need
> >>>> to catch up with the quorum.. how many txns have happened by that
> >> point? By
> >>>> the way, we use
> >>>> this patch:
> >>>>
> >>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1804
> >>>>
> >>>> to measure transaction rate, do you have any approximation of what
> your
> >>>> transaction rate might be?
> >>>>
> >>>>
> >>>>>
> >>>>> I can alter the settings as has been discussed, but I worry that
I'm
> >> just
> >>>>> delaying the same thing from happening again, if I deploy another
> storm
> >>>>> topology or something.  How can I get the cluster in a state where
I
> >> can
> >>>> be
> >>>>> confident that it won't crash in a similar way as load increases,
or
> at
> >>>>> least set up some kind of monitoring that will let me know something
> is
> >>>>> unhealthy?
> >>>>>
> >>>>
> >>>> I think it depends on what your txn rate is, lets measure that first
I
> >>>> guess.
> >>>>
> >>>>
> >>>> -rgs
> >>>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message