zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raúl Gutiérrez Segalés <...@itevenworks.net>
Subject Re: entire cluster dies with EOFException
Date Mon, 07 Jul 2014 16:50:59 GMT
On 7 July 2014 09:39, Aaron Zimmerman <azimmerman@sproutsocial.com> wrote:

> What I don't understand is how the entire cluster could die in such a
> situation.  I was able to load zookeeper locally using the snapshot and 10g
> log file without apparent issue.


Sure, but it's syncing up with other learners that becomes challenging when
having either big snapshots or too many txnlogs, right?


>  I can see how large amounts of data could
> cause latency issues in syncing causing a single worker to die, but how
> would that explain the node's inability to restart?  When the server
> replays the log file, does it have to sync the transactions to other nodes
> while it does so?
>

Given that your txn churn is so big, by the time it finished up reading
from disc it'll need
to catch up with the quorum.. how many txns have happened by that point? By
the way, we use
this patch:

https://issues.apache.org/jira/browse/ZOOKEEPER-1804

to measure transaction rate, do you have any approximation of what your
transaction rate might be?


>
> I can alter the settings as has been discussed, but I worry that I'm just
> delaying the same thing from happening again, if I deploy another storm
> topology or something.  How can I get the cluster in a state where I can be
> confident that it won't crash in a similar way as load increases, or at
> least set up some kind of monitoring that will let me know something is
> unhealthy?
>

I think it depends on what your txn rate is, lets measure that first I
guess.


-rgs

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message