asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Murtadha Hubail <hubail...@gmail.com>
Subject Re: Stability Q
Date Wed, 02 May 2018 14:34:40 GMT
I checked the logs, and this basically was caused by the sleep/wake up. On the CC, we keep
a timestamp of the last heartbeat that was received from each NC and compare it against the
current system time to check if the NC missed enough heartbeats to be considered dead. We
perform this check every 10 seconds on the CC. Every time you wake up your mac after a sleep
period more than the max heartbeat miss, and the CC monitoring task runs before the other
NC processes resume sending heartbeats, there is a possibility of this happening since the
timestamp on the CC memory is the time of the last heartbeat received before you put your
mac to sleep.
We have implemented a mechanism on the current master to try to reduce the possibility of
such false positive heartbeat miss. The CC now attempts to contract the NC and ask it to shut
down and so if the NC is actually still alive, the NC Service process is supposed to restart
the NC process, which will cause it to rejoin the cluster and the cluster will become active
again. However, our NC Service currently doesn't restart the NC process, but I think we should
change that.
Another option to reduce, but not eliminate, the possibility of this issue is to increase
the heartbeat miss to something very large (e.g. 24 hours). It might be suitable for a playground
environment like Macs and PCs, but not ideal OOTB configuration for cluster deployment.

On 05/02/2018, 10:21 AM, "Mike Carey" <dtabass@gmail.com> wrote:

    Let me know what it turns out to be!
    
    
    On 5/1/18 12:31 AM, Murtadha Hubail wrote:
    > This is most likely caused by missing heartbeat from the NC to the CC. Some macOS
versions had issues with reestablishing connected sockets after waking up from sleep.
    > But it could also be some unexpected exception that caused the NC to shut down. If
you could share the logs with me, I can tell you for sure.
    >
    > Cheers,
    > Murtadha
    >
    > On 05/01/2018, 9:06 AM, "Michael Carey" <mjcarey@ics.uci.edu> wrote:
    >
    >      Q:  Do we maybe have a stability regression in recent versions (e.g.,
    >      the one leading to the UW snapshot)?  They have occasionally seen things
    >      like this and I just did too.  (The system had been running for awhile
    >      in the background on my Mac - e.g., for a day or so.)
    >      
    >      Error: Cluster is in UNUSABLE state.
    >        One or more Node Controllers have left or haven't joined yet.
    >      
    >      
    >
    >
    
    



Mime
View raw message