asterixdb-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <>
Subject [jira] [Commented] (ASTERIXDB-1076) False failures cause denying new queries
Date Tue, 24 Oct 2017 22:04:00 GMT


ASF subversion and git services commented on ASTERIXDB-1076:

Commit 8734988fae1d32dce74ec9fff5cee0dd5c2e5d15 in asterixdb's branch refs/heads/master from
[;h=8734988 ]

[ASTERIXDB-1076][HYR] Prevent node death false positives

- Measure actual time since last heartbeat touched, not based on number
  of dead cycle detections since last heartbeat received
- Update heartbeat touch on job result received, in addition to when
  heartbeat data is received
- Minor refactoring in NC/CC config

Change-Id: Idb1abcc2b783b192b88ed988d398fcfe763531e9
Sonar-Qube: Jenkins <>
Tested-by: Jenkins <>
Contrib: Jenkins <>
Integration-Tests: Jenkins <>
Reviewed-by: Ian Maxon <>

> False failures cause denying new queries
> ----------------------------------------
>                 Key: ASTERIXDB-1076
>                 URL:
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: HYR - Hyracks
>            Reporter: Yingyi Bu
>            Assignee: Michael Blow
>              Labels: soon
> When CPUs in the cluster are saturated for computations,  the heartbeat from slave nodes
to the master node might get delayed.  In this case, the master node thinks a node fails,
and can no longer adds the node back.  Hence, the entire cluster is not usable and an instance
restart is needed.
> Two things need to be fixed:
> 1.  (at least) expose AsterixDB configuration parameters to allow users to set a large
heartbeat threshold;
> 2.  allow a node to leave and re-join a hyracks cluster.
> In the long term, we might need to investigate better liveness check strategies.
> To reproduce that issue,  just let slave nodes' CPUs overloaded and you will see that.
> The exception " Asterix Cluster Global recovery is not yet complete and The system is
in ACTIVE state" will be thrown for upcoming queries.

This message was sent by Atlassian JIRA

View raw message