cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Stupp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11547) Add background thread to check for clock drift
Date Fri, 22 Apr 2016 14:03:13 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253982#comment-15253982
] 

Robert Stupp commented on CASSANDRA-11547:
------------------------------------------

bq. strong warning or even freeze

I'm not excited about freezing a node, if some {{if (clockDrift > X)}} triggers. This can
(and in most installations will) lead to a complete outage of the cluster.

bq. warning ... out of sync with the majority of the cluster

Is it the majority (quorum?) of all nodes, of all live nodes, of all reachable nodes? I think
that is way too complicated.

Issuing a warning as in this patch is absolutely fine IMO. If someone wants to freeze a node
if such a warning is issued, it's still possible by monitoring the log file. It's also possible
to send an alert by monitoring the log file (as many people already do : monitoring the log
file for errors & warnings).

> Add background thread to check for clock drift
> ----------------------------------------------
>
>                 Key: CASSANDRA-11547
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11547
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jason Brown
>            Assignee: Jason Brown
>            Priority: Minor
>              Labels: clocks, time
>
> The system clock has the potential to drift while a system is running. As a simple way
to check if this occurs, we can run a background thread that wakes up every n seconds, reads
the system clock, and checks to see if, indeed, n seconds have passed. 
> * If the clock's current time is less than the last recorded time (captured n seconds
in the past), we know the clock has jumped backward.
> * If n seconds have not elapsed, we know the system clock is running slow or has moved
backward (by a value less than n)
> * If (n + a small offset) seconds have elapsed, we can assume we are within an acceptable
window of clock movement. Reasons for including an offset are the clock checking thread might
not have been scheduled on time, or garbage collection, and so on.
> * If the clock is greater than (n + a small offset) seconds, we can assume the clock
jumped forward.
> In the unhappy cases, we can write a message to the log and increment some metric that
the user's monitoring systems can trigger/alert on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message