mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dominic Hamon (JIRA)" <>
Subject [jira] [Commented] (MESOS-1523) ZooKeeper timeout should be longer
Date Fri, 20 Jun 2014 21:44:24 GMT


Dominic Hamon commented on MESOS-1523:

> ZooKeeper timeout should be longer
> ----------------------------------
>                 Key: MESOS-1523
>                 URL:
>             Project: Mesos
>          Issue Type: Improvement
>          Components: slave
>            Reporter: Dominic Hamon
>            Assignee: Dominic Hamon
> {{zookeeper_init}} relies on name resolution which can temporarily fail. When {{getaddrinfo}}
returns {{EAI_AGAIN}}, which normally suggests a retry, ZooKeeper instead returns {{EINVAL}}
to the calling code. We currently use this as a signal that we should retry.
> However, our timeout is set to 10 seconds. If there are, say, three nameservers and each
takes fifteen seconds to timeout, we will see a single call to {{zookeeper_init}} that takes
45 seconds and will thus only try once before aborting.
> To increase resilience in the case of name server failure, we should increase this timeout.
> Given that the slave is still able to respond to health checks and tasks are still running,
this can be quite long. However, we don't want to stay in this state too long as we want to
readily observer a more persistent name resolution error.
> As such, ten minutes seems reasonable.

This message was sent by Atlassian JIRA

View raw message