cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift
Date Thu, 05 Feb 2015 00:29:35 GMT


Benedict commented on CASSANDRA-8732:

[~aweisberg] basically, yes. since mostly we're dealing with application induced delay (receiving
or sending server being overloaded, in GC, etc), this seems a pretty reasonable tradeoff.
its only goal is to avoid wasting work, and in the event of a major network blip we're probably
not starved for local resources. Of course a GC pause slowing down receipt would be perceived
the same, and is likely exactly the kind of scenario we want to shed timed out messages for.
I'm sure there's a simple further tweak to help guard against this. 

Let's assume on the recipient we have:

source node wallclock: S
message timeout delta at send time: T
recipient node wallclock at receipt: R
recipient node default timeout: D

Then let's say we calculate S+T and min(R+T, S+2T) and take whichever is closest to R+(D/2)

This helps guard against significant network delay or GC pauses being undercounted, especially
on queries that were close to timeout anyway (e.g. due to slow processing on the source node),
by capping our forgiveness of clock skew to twice the message's remaining timeout when sent.

This is just a quick hand wavy suggestion, it's quite possible there's another better approach
along the same lines. It retains the simplicity which is the important thing. We could perhaps
make the cap S+xT, and have x be a configurable parameter for power users.

> Make inter-node timeouts tolerate clock skew and drift
> ------------------------------------------------------
>                 Key: CASSANDRA-8732
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Ariel Weisberg
> Right now internode timeouts rely on currentTimeMillis() (and NTP) to make sure that
tasks don't expire before they arrive.
> Every receiver needs to deduce the offset between its nanoTime and the remote nanoTime.
I don't think currentTimeMillis is a good choice because it is designed to be manipulated
by operators and NTP. I would probably be comfortable assuming that nanoTime isn't going to
move in significant ways without something that could be classified as operator error happening.
> I suspect the one timing method you can rely on being accurate is nanoTime within a node
(on average) and that a node can report on its own scheduling jitter (on average).
> Finding the offset requires knowing what the network latency is in one direction.
> One way to do this would be to periodically send a ping request which generates a series
of ping responses at fixed intervals (maybe by UDP?). The responses should corrected for scheduling
jitter since the fixed intervals may not be exactly achieved by the sender. By measuring the
time deviation between ping responses and their expected arrival time (based on the interval)
and correcting for the remotely reported scheduling jitter, you should be able to measure
latency in one direction.
> A weighted moving average (only correct for drift, not readjustment) of these measurements
would eventually converge on a close answer and would not be impacted by outlier measurements.
It may also make sense to drop the largest N samples to improve accuracy.
> One you know network latency you can add that to the timestamp of each ping and compare
to the local clock and know what the offset is.
> These measurements won't calculate the offset to be too small (timeouts fire early),
but could calculate the offset to be too large (timeouts fire late). The conditions where
you the offset won't be accurate are the conditions where you also want them firing reliably.
This and bootstrapping in bad conditions is what I am most uncertain of.

This message was sent by Atlassian JIRA

View raw message