zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Fast leader election initial delay, is that possible?
Date Thu, 18 Aug 2011 17:13:37 GMT
The thought is that a server would not complain about connection refused or
inability to form a quorum during the first (say) twenty seconds of
operation.

The thesis is that warnings from these causes during that time are spurious.

As I mentioned, I don't see this as urgent or even necessarily a good idea.
 I completely reboot a ZK cluster once every year or three.  When I am doing
a rolling upgrade, I *want* to see alerts when I bounce a machine.  If I
don't want to see those alerts, my monitoring system allows me to put a
machine into maintenance mode for a short period of time to temporarily
suppress the warnings.

All I was doing was translating and elaborating the original poster's
suggestion, not so much endorsing it.

On Thu, Aug 18, 2011 at 8:54 AM, Flavio Junqueira <fpj@yahoo-inc.com> wrote:

> Hi Ted, I don't see how one can automate the distinction between a machine
> that is down because it crashed and a machine that is down because it hasn't
> started yet. Assuming that we are logging the machine unavailability as we
> are doing currently, one can always look at the timestamp of the warning and
> remember that this is the time the machines were bootstrapping.
> Consequently, I don't really see the point of reducing the number of
> warnings, unless the warnings are really polluting the logs. I typically
> don't see so many that prevents me from reading the rest, but you may have a
> different perception. Also, recall that we back off, so the warnings become
> less frequent over time.
>
> I'm open to ideas, though. If you see anything wrong in my rationale or if
> you have an idea of how to do it differently, then I'd be happy to hear.
> However, if the idea is simply to add a parameter that configures the time
> for leader election to start, then I'm currently not in favor.
>
> -Flavio
>
> On Aug 18, 2011, at 5:39 PM, Ted Dunning wrote:
>
> Flavio,
>
> What you say is correct, but the original poster does have a point that
> many
> of these warnings are to be expected and there is a heuristic that might
> assist in distinguishing some of these cases so that false alarms in the
> logs could be decreased.
>
> That doesn't seem like a big deal to me, but different people have
> different
> itches.  In my experience, restarting a ZK cluster from zero almost never
> happens.
>
> On Thu, Aug 18, 2011 at 8:36 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
>
>
> On Thu, Aug 18, 2011 at 12:15 AM, Sampath Perera <sampath@adroitlogic.com
> >wrote:
>
>
>
> Hhmmm, I think this is a bit different isn't it? Here we know that the
>
> first
>
> server to come will be failing to connect to the other as they are not yet
>
> up. Anyway our real issue is the warning.
>
>
>
> We know that.
>
>
> But how does the server know that it is the first server?  That is the
>
> whole point of the leader election.  You might just have a server rejoining
>
> a cluster.  Or you might have a cluster that has been turned off.  Or a
>
> cluster with 2 out of 5 machines off and we tried to touch the other down
>
> machine before the others.
>
>
>
>
> Would you like to suggest a patch?
>
>
>
> Of course I do.. will prepare a patch and attach.
>
>
>
> Great!
>
>
>
>
>   *flavio*
> *junqueira*
>
> research scientist
>
> fpj@yahoo-inc.com
> direct +34 93-183-8828
>
> avinguda diagonal 177, 8th floor, barcelona, 08018, es
> phone (408) 349 3300    fax (408) 349 3301
>
>
>

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message