asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yingyi Bu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ASTERIXDB-1076) False failures cause denying new queries
Date Fri, 11 Sep 2015 22:28:45 GMT

    [ https://issues.apache.org/jira/browse/ASTERIXDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741637#comment-14741637
] 

Yingyi Bu commented on ASTERIXDB-1076:
--------------------------------------

WorkQueue maintains all the cluster management event processing threads,
 but it doesn't include heartbeat processing.  Those management events may
deserve a high priority, maybe NORM_PRIORITY is OK.
Real data processing operators are run in
org.apache.hyracks.control.nc.Task,  where we already set their priority to
be Thread.MIN_PRIORITY  (line 270).

Heartbeat processing is separated in
org.apache.hyracks.control.nc.NodeControllerService (line 294):
timer.schedule(heartbeatTask, 0, nodeParameters.getHeartbeatPeriod());
I guess we can define our own timer thread, set the MAX_PRIORITY for it,
 and see if it works.

Best,
Yingyi

On Fri, Sep 11, 2015 at 3:00 PM, Till Westmann (JIRA) <jira@apache.org>



> False failures cause denying new queries
> ----------------------------------------
>
>                 Key: ASTERIXDB-1076
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1076
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: AsterixDB
>            Reporter: Yingyi Bu
>            Priority: Critical
>
> When CPUs in the cluster are saturated for computations,  the heartbeat from slave nodes
to the master node might get delayed.  In this case, the master node thinks a node fails,
and can no longer adds the node back.  Hence, the entire cluster is not usable and an instance
restart is needed.
> Two things need to be fixed:
> 1.  (at least) expose AsterixDB configuration parameters to allow users to set a large
heartbeat threshold;
> 2.  allow a node to leave and re-join a hyracks cluster.
> In the long term, we might need to investigate better liveness check strategies.
> To reproduce that issue,  just let slave nodes' CPUs overloaded and you will see that.
> The exception " Asterix Cluster Global recovery is not yet complete and The system is
in ACTIVE state" will be thrown for upcoming queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message