asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yingyi Bu (JIRA)" <>
Subject [jira] [Commented] (ASTERIXDB-1076) False failures cause denying new queries
Date Fri, 11 Sep 2015 22:28:45 GMT


Yingyi Bu commented on ASTERIXDB-1076:

WorkQueue maintains all the cluster management event processing threads,
 but it doesn't include heartbeat processing.  Those management events may
deserve a high priority, maybe NORM_PRIORITY is OK.
Real data processing operators are run in,  where we already set their priority to
be Thread.MIN_PRIORITY  (line 270).

Heartbeat processing is separated in (line 294):
timer.schedule(heartbeatTask, 0, nodeParameters.getHeartbeatPeriod());
I guess we can define our own timer thread, set the MAX_PRIORITY for it,
 and see if it works.


On Fri, Sep 11, 2015 at 3:00 PM, Till Westmann (JIRA) <>

> False failures cause denying new queries
> ----------------------------------------
>                 Key: ASTERIXDB-1076
>                 URL:
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: AsterixDB
>            Reporter: Yingyi Bu
>            Priority: Critical
> When CPUs in the cluster are saturated for computations,  the heartbeat from slave nodes
to the master node might get delayed.  In this case, the master node thinks a node fails,
and can no longer adds the node back.  Hence, the entire cluster is not usable and an instance
restart is needed.
> Two things need to be fixed:
> 1.  (at least) expose AsterixDB configuration parameters to allow users to set a large
heartbeat threshold;
> 2.  allow a node to leave and re-join a hyracks cluster.
> In the long term, we might need to investigate better liveness check strategies.
> To reproduce that issue,  just let slave nodes' CPUs overloaded and you will see that.
> The exception " Asterix Cluster Global recovery is not yet complete and The system is
in ACTIVE state" will be thrown for upcoming queries.

This message was sent by Atlassian JIRA

View raw message