asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ian Maxon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ASTERIXDB-1076) False failures cause denying new queries
Date Sat, 12 Sep 2015 00:03:46 GMT

    [ https://issues.apache.org/jira/browse/ASTERIXDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741761#comment-14741761
] 

Ian Maxon commented on ASTERIXDB-1076:
--------------------------------------

Oh, it's good that the heartbeats are at least not stuck in the big ol' WorkQueue. I was under
the impression that was how it was. 

For addressing 1), the parameters for controlling heartbeat interval exist in Hyracks but
they're command line args to the CC. So actually it is possible to change them, you just put
them in the normal place where -Xmx and so on belong in the asterix-configuration.xml (I think,
haven't tried... :) ) 
It'd probably be easier/clearer to migrate them to be their own attributes in that file, otherwise
it's kind of impossible to tell that the option exists in the first place. 

> False failures cause denying new queries
> ----------------------------------------
>
>                 Key: ASTERIXDB-1076
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1076
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: AsterixDB
>            Reporter: Yingyi Bu
>            Assignee: Yingyi Bu
>            Priority: Critical
>
> When CPUs in the cluster are saturated for computations,  the heartbeat from slave nodes
to the master node might get delayed.  In this case, the master node thinks a node fails,
and can no longer adds the node back.  Hence, the entire cluster is not usable and an instance
restart is needed.
> Two things need to be fixed:
> 1.  (at least) expose AsterixDB configuration parameters to allow users to set a large
heartbeat threshold;
> 2.  allow a node to leave and re-join a hyracks cluster.
> In the long term, we might need to investigate better liveness check strategies.
> To reproduce that issue,  just let slave nodes' CPUs overloaded and you will see that.
> The exception " Asterix Cluster Global recovery is not yet complete and The system is
in ACTIVE state" will be thrown for upcoming queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message