Return-Path: X-Original-To: apmail-asterixdb-dev-archive@minotaur.apache.org Delivered-To: apmail-asterixdb-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2458B17DAE for ; Sat, 12 Sep 2015 00:05:35 +0000 (UTC) Received: (qmail 49655 invoked by uid 500); 12 Sep 2015 00:05:35 -0000 Delivered-To: apmail-asterixdb-dev-archive@asterixdb.apache.org Received: (qmail 49604 invoked by uid 500); 12 Sep 2015 00:05:34 -0000 Mailing-List: contact dev-help@asterixdb.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@asterixdb.incubator.apache.org Delivered-To: mailing list dev@asterixdb.incubator.apache.org Received: (qmail 49592 invoked by uid 99); 12 Sep 2015 00:05:34 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Sep 2015 00:05:34 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 3F4381A17A6 for ; Sat, 12 Sep 2015 00:05:34 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.879 X-Spam-Level: ** X-Spam-Status: No, score=2.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id RATyHevR_WEu for ; Sat, 12 Sep 2015 00:05:33 +0000 (UTC) Received: from mail-yk0-f170.google.com (mail-yk0-f170.google.com [209.85.160.170]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 9B76E210C8 for ; Sat, 12 Sep 2015 00:05:32 +0000 (UTC) Received: by ykdt18 with SMTP id t18so87053292ykd.3 for ; Fri, 11 Sep 2015 17:05:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=QoGjTXEetWIlnBxDEACNbeNRc4PFtiP8wZ0nSqUNJLA=; b=C8BxQcQ9F8ncrEHnfisP5oA9/ZsB8SvY92ztvF70X2Xia8l1QtBB0khTb01xliadeZ Kj6tzqloYJjwNZRFRpCVkh1uoxz06v86kAQEHXb0xyHuWMN2gTTQBse5itRiFegURVZJ mp/o3VMj7bDvM7Tw4xzTFXMSJXnXPmWnqwCByQr2Er+OuFAps3dE3jSvR+ddcs1xvadf kw6PLN8ulsx5tlc856Dq0g9eetppNjIAyFv6lzhRbg51YlelBRyWiMV6awATwdcoKWdc A5c7RGwv5p/LvgXdPUHvZIBX0qymT1gfLDKDeQby65EL7dzQPHGeLte5DzSNMBONGv6G mhYg== MIME-Version: 1.0 X-Received: by 10.170.96.196 with SMTP id n187mr1747674yka.54.1442016331966; Fri, 11 Sep 2015 17:05:31 -0700 (PDT) Received: by 10.37.21.196 with HTTP; Fri, 11 Sep 2015 17:05:31 -0700 (PDT) In-Reply-To: References: Date: Fri, 11 Sep 2015 17:05:31 -0700 Message-ID: Subject: Re: [jira] [Commented] (ASTERIXDB-1076) False failures cause denying new queries From: Yingyi Bu To: dev@asterixdb.incubator.apache.org Content-Type: multipart/alternative; boundary=001a113a95863a3109051f8197b7 --001a113a95863a3109051f8197b7 Content-Type: text/plain; charset=UTF-8 Right, exposing the configuration parameters is a separate issue. Best, Yingyi On Fri, Sep 11, 2015 at 5:03 PM, Ian Maxon (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/ASTERIXDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741761#comment-14741761 > ] > > Ian Maxon commented on ASTERIXDB-1076: > -------------------------------------- > > Oh, it's good that the heartbeats are at least not stuck in the big ol' > WorkQueue. I was under the impression that was how it was. > > For addressing 1), the parameters for controlling heartbeat interval exist > in Hyracks but they're command line args to the CC. So actually it is > possible to change them, you just put them in the normal place where -Xmx > and so on belong in the asterix-configuration.xml (I think, haven't > tried... :) ) > It'd probably be easier/clearer to migrate them to be their own attributes > in that file, otherwise it's kind of impossible to tell that the option > exists in the first place. > > > False failures cause denying new queries > > ---------------------------------------- > > > > Key: ASTERIXDB-1076 > > URL: > https://issues.apache.org/jira/browse/ASTERIXDB-1076 > > Project: Apache AsterixDB > > Issue Type: Bug > > Components: AsterixDB > > Reporter: Yingyi Bu > > Assignee: Yingyi Bu > > Priority: Critical > > > > When CPUs in the cluster are saturated for computations, the heartbeat > from slave nodes to the master node might get delayed. In this case, the > master node thinks a node fails, and can no longer adds the node back. > Hence, the entire cluster is not usable and an instance restart is needed. > > Two things need to be fixed: > > 1. (at least) expose AsterixDB configuration parameters to allow users > to set a large heartbeat threshold; > > 2. allow a node to leave and re-join a hyracks cluster. > > In the long term, we might need to investigate better liveness check > strategies. > > To reproduce that issue, just let slave nodes' CPUs overloaded and you > will see that. > > The exception " Asterix Cluster Global recovery is not yet complete and > The system is in ACTIVE state" will be thrown for upcoming queries. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) > --001a113a95863a3109051f8197b7--