5x seems like a lot but what is the functional difference between 5 and 25 ms?
I think there is probably some problem you could solve a different way using the guarantees
that zk already makes.
-m
On Sep 10, 2013, at 3:34 PM, Jeremy Stribling <strib@nicira.com> wrote:
> I mostly agree, but let's assume that a ~5x speedup in detecting those types of failures
is considered significant for some people. Are there technical reasons that would prevent
this idea from working?
>
> On 09/10/2013 01:31 PM, Ted Dunning wrote:
>> I don't see the strong value here. A few failures would be detected more
>> quickly, but I am not convinced that this would actually improve
>> functionality significantly.
>>
>>
>> On Tue, Sep 10, 2013 at 1:01 PM, Jeremy Stribling <strib@nicira.com> wrote:
>>
>>> Hi all,
>>>
>>> Let's assume that you wanted to deploy ZK in a virtualized environment,
>>> despite all of the known drawbacks. Assume we could deploy it such that
>>> the ZK servers were all using independent CPUs and storage (though not
>>> dedicated disks). Obviously, the shared disks (shared with other, non-ZK
>>> VMs on the same hypervisor) will cause ZK to hit the default session
>>> timeout occasionally, so you would need to raise the existing session
>>> timeout to something like 30 seconds.
>>>
>>> I'm curious if there would be any technical drawbacks to adding an
>>> additional heartbeat mechanism between the clients and the servers, which
>>> would have the goal of detecting network-only failures faster than the
>>> existing heartbeat mechanism. The idea is that there would be a new thread
>>> dedicated to processing these heartbeats, which would not get blocked on
>>> I/O. Then the clients could configure a second, smaller timeout value, and
>>> it would be assumed that any such timeout indicated a real problem. The
>>> existing mechanism would still be in place to catch I/O-related errors.
>>>
>>> I understand the philosophy that there should be some heartbeat mechanism
>>> that takes the disk into account, but I'm having trouble coming up with
>>> technical reasons not to add a second mechanism. Obviously, the advantage
>>> would be that the clients could detect network failures and system crashes
>>> more quickly in an environment with slow disks, and fail over to other
>>> servers more quickly. The only disadvantages I can come up with are:
>>>
>>> 1) More code complexity, and slightly more heartbeat traffic on the wire
>>> 2) I think the servers have to log session expirations to disk, so if the
>>> sessions expire at a faster rate than the disk can handle, it might lead to
>>> a large backlog.
>>>
>>> Are there other drawbacks I am missing? Would a patch that added
>>> something like this be considered, or is it dead from the start? Thanks,
>>>
>>> Jeremy
>
|