asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yingyi Bu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ASTERIXDB-1076) False failures cause denying new queries
Date Tue, 18 Aug 2015 21:06:45 GMT

     [ https://issues.apache.org/jira/browse/ASTERIXDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yingyi Bu updated ASTERIXDB-1076:
---------------------------------
    Description: 
When CPUs in the cluster are saturated for computations,  the heartbeat from slave nodes to
the master node might get delayed.  In this case, the master node thinks a node fails, and
can no longer adds the node back.  Hence, the entire cluster is not usable and an instance
restart is needed.

Two things need to be fixed:
1.  (at least) expose AsterixDB configuration parameters to allow users to set a large heartbeat
threshold;
2.  allow a node to leave and re-join a hyracks cluster.

In the long term, we might need to investigate better liveness check strategies.


To reproduce that issue,  just let slave nodes' CPUs overloaded and you will see that.
The exception " Asterix Cluster Global recovery is not yet complete and The system is in ACTIVE
state" will be thrown for upcoming queries.

  was:
When CPUs in the cluster are saturated for computations,  the heartbeat from slave nodes to
the master node might get delayed.  In this case, the master node thinks a node fails, and
can no longer adds the node back.  Hence, the entire cluster is not usable and an instance
restart is needed.

Two things need to be fixed:
1.  (at least) expose AsterixDB configuration parameters to allow users to set a large heartbeat
threshold;
2.  allow a node to leave and re-join a hyracks cluster.

In the long term, we might need to investigate better liveness check strategies.


To reproduce that issue,  just let slave nodes' CPUs overloaded and you will see that.

        Summary: False failures cause denying new queries  (was: False failures triggers denying
new queries)

> False failures cause denying new queries
> ----------------------------------------
>
>                 Key: ASTERIXDB-1076
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1076
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: AsterixDB
>            Reporter: Yingyi Bu
>            Priority: Critical
>
> When CPUs in the cluster are saturated for computations,  the heartbeat from slave nodes
to the master node might get delayed.  In this case, the master node thinks a node fails,
and can no longer adds the node back.  Hence, the entire cluster is not usable and an instance
restart is needed.
> Two things need to be fixed:
> 1.  (at least) expose AsterixDB configuration parameters to allow users to set a large
heartbeat threshold;
> 2.  allow a node to leave and re-join a hyracks cluster.
> In the long term, we might need to investigate better liveness check strategies.
> To reproduce that issue,  just let slave nodes' CPUs overloaded and you will see that.
> The exception " Asterix Cluster Global recovery is not yet complete and The system is
in ACTIVE state" will be thrown for upcoming queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message