zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sampath Perera <samp...@adroitlogic.com>
Subject Re: Fast leader election initial delay, is that possible?
Date Thu, 18 Aug 2011 16:54:05 GMT
Hi Flavio,

On Thu, Aug 18, 2011 at 9:24 PM, Flavio Junqueira <fpj@yahoo-inc.com> wrote:

> Hi Ted, I don't see how one can automate the distinction between a machine
> that is down because it crashed and a machine that is down because it hasn't
> started yet. Assuming that we are logging the machine unavailability as we
> are doing currently, one can always look at the timestamp of the warning and
> remember that this is the time the machines were bootstrapping.
> Consequently, I don't really see the point of reducing the number of
> warnings, unless the warnings are really polluting the logs. I typically
> don't see so many that prevents me from reading the rest, but you may have a
> different perception. Also, recall that we back off, so the warnings become
> less frequent over time.
>

True, but one of customer deployments have a log analyzing tool and sends
notifications for the errors on the log, as you previously said we cannot
get an optimal value for this timeout, but we can come up with a sub optimal
value to get rid of this warning.


>
> I'm open to ideas, though. If you see anything wrong in my rationale or if
> you have an idea of how to do it differently, then I'd be happy to hear.
> However, if the idea is simply to add a parameter that configures the time
> for leader election to start, then I'm currently not in favor.
>

Well, what I was originally looking for was to delay the leader election,
but as pointed out by Ted, I was going to provide a path on printing this
warning. (If you carefully look at Ted's comment, and my response,  was
thinking of a timeout for the warning to be considered as a warning to be
printed on the log... at least that is what I got from Ted's first comment).
What do you think about that?


>
> -Flavio
>
> On Aug 18, 2011, at 5:39 PM, Ted Dunning wrote:
>
> Flavio,
>
> What you say is correct, but the original poster does have a point that
> many
> of these warnings are to be expected and there is a heuristic that might
> assist in distinguishing some of these cases so that false alarms in the
> logs could be decreased.
>
> That doesn't seem like a big deal to me, but different people have
> different
> itches.  In my experience, restarting a ZK cluster from zero almost never
> happens.
>
> On Thu, Aug 18, 2011 at 8:36 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
>
>
> On Thu, Aug 18, 2011 at 12:15 AM, Sampath Perera <sampath@adroitlogic.com
> >wrote:
>
>
>
> Hhmmm, I think this is a bit different isn't it? Here we know that the
>
> first
>
> server to come will be failing to connect to the other as they are not yet
>
> up. Anyway our real issue is the warning.
>
>
>
> We know that.
>
>
> But how does the server know that it is the first server?  That is the
>
> whole point of the leader election.  You might just have a server rejoining
>
> a cluster.  Or you might have a cluster that has been turned off.  Or a
>
> cluster with 2 out of 5 machines off and we tried to touch the other down
>
> machine before the others.
>
>
>
>
> Would you like to suggest a patch?
>
>
>
> Of course I do.. will prepare a patch and attach.
>
>
>
> Great!
>
>
>
>
> *flavio*
> *junqueira*
>
> research scientist
>
> fpj@yahoo-inc.com
> direct +34 93-183-8828
>
> avinguda diagonal 177, 8th floor, barcelona, 08018, es
> phone (408) 349 3300    fax (408) 349 3301
>
>
>


-- 
Thanks,
Sampath
http://adroitlogic.org

Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message