From dev-return-38822-archive-asf-public=cust-asf.ponee.io@ignite.apache.org Sun Sep 9 23:07:39 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 7FB0B180671 for ; Sun, 9 Sep 2018 23:07:38 +0200 (CEST) Received: (qmail 72672 invoked by uid 500); 9 Sep 2018 21:07:37 -0000 Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list dev@ignite.apache.org Received: (qmail 72660 invoked by uid 99); 9 Sep 2018 21:07:36 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 09 Sep 2018 21:07:36 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 953FB18C3E7 for ; Sun, 9 Sep 2018 21:07:36 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.889 X-Spam-Level: * X-Spam-Status: No, score=1.889 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id Pzf3ZFuntydg for ; Sun, 9 Sep 2018 21:07:35 +0000 (UTC) Received: from mail-io1-f51.google.com (mail-io1-f51.google.com [209.85.166.51]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 4706D5F3AC for ; Sun, 9 Sep 2018 21:07:35 +0000 (UTC) Received: by mail-io1-f51.google.com with SMTP id c22-v6so5417516iob.1 for ; Sun, 09 Sep 2018 14:07:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=7S00+F/BaplF/eauOwrPpc2XsuxSIzHyylOZEA9D9m8=; b=k64UD6nV/MTJQHXm465EtbOkjWy+kjwGtkjPeVAHdFAPUNMpkQpdKIVr8wUcSHRDTj ldOerkRWgSeZGYi/cDq0uFYi3wv6gr5+xGljuF3JOqu9xFRNiB7zMO4nlBuliev4oPry OZMG6s9I37CgXV58uMq/ZQ5n0TGTzbShUdvVOcq9N7vhQim4TRzyWNyR+pNKGwzTMYC0 9veiH0XkAlYPkIIDTmGBOdLdmy0b3Vit/U8BX09iGOt6DVErq/ANX8RBRA/qORi4KuWC O+ivsgJVvcNEd5eyueiape17TuZJ0jrzEWURetDJif+5Ejm3Jz2MYCynZdmp61cgTG9P +pmQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=7S00+F/BaplF/eauOwrPpc2XsuxSIzHyylOZEA9D9m8=; b=OPqPpQo1fQp7AI5UIojboZa/4CXYiz3gNNwpjNGLVhgidI3U+M52iIiPUXjMba+N5P gZs57oJItuqYcKyCwKn05Q5reeql9jMfnB924L706675On8aUDWkS0RSnoKtf39JdeWp XAypPpw3uhkoJ6OJy/2fkMiYBvxW/H448JjCLzwg1ai6IO3nIR34t/mOmH3C+I6qq0cz JQZDjeiiTZzjIO2xAoq/meDCdRBKGR7CCV5ftbfAD6WGrgpexgHUGoYUp2HPR0d2HV8J oNYkQTG+krXT2ajMHqM3xgIwBvoVEWAGbJIOq3V/NCFvQ4xZpoeiHwsCoRQNSkITCcrw I63w== X-Gm-Message-State: APzg51DiIUsFT2lqwNqYs6jC86b9nX1TrAAOpqm1wDl6/Vbwv5/Fk1rM l6ctQq/sRDbLhLinQ3SxsTwQVxWfFx46zdLpsx5sBaF8 X-Google-Smtp-Source: ANB0VdZnsBfbLuABeK6V7ui+VATgwVZX1EDCNVzJw00wD/F493iSbaXl4mu7CbRRlD1eWK+JM80K7tdFrMPwNDm9Dmk= X-Received: by 2002:a6b:6b05:: with SMTP id g5-v6mr14089346ioc.256.1536527254386; Sun, 09 Sep 2018 14:07:34 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: David Harvey Date: Sun, 9 Sep 2018 17:07:22 -0400 Message-ID: Subject: Re: Critical worker threads liveness checking drawbacks To: dev@ignite.apache.org Content-Type: multipart/alternative; boundary="0000000000002f83bf057576a1c4" --0000000000002f83bf057576a1c4 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable It would be safer to restart the entire cluster than to remove the last node for a cache that should be redundant. On Sun, Sep 9, 2018, 4:00 PM Andrey Gura wrote: > Hi, > > I agree with Yakov that we can provide some option that manage worker > liveness checker behavior in case of observing that some worker is > blocked too long. > At least it will some workaround for cases when node fails is too > annoying. > > Backups count threshold sounds good but I don't understand how it will > help in case of cluster hanging. > > The simplest solution here is alert in cases of blocking of some > critical worker (we can improve WorkersRegistry for this purpose and > expose list of blocked workers) and optionally call system configured > failure processor. BTW, failure processor can be extended in order to > perform any checks (e.g. backup count) and decide whether it should > stop node or not. > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov wrote= : > > > > David, Yakov, I understand your fears. But liveness checks deal with > > _critical_ conditions, i.e. when such a condition is met we conclude th= e > > node as totally broken, and there is no sense to keep it alive regardle= ss > > the data it contains. If we want to give it a chance, then the conditio= n > > (long fsync etc.) should not considered as critical at all. > > > > =D1=81=D0=B1, 8 =D1=81=D0=B5=D0=BD=D1=82. 2018 =D0=B3. =D0=B2 15:18, Ya= kov Zhdanov : > > > > > Agree with David. We need to have an opporunity set backups count > threshold > > > (at runtime also!) that will not allow any automatic stop if there > will be > > > a data loss. Andrey, what do you think? > > > > > > --Yakov > > > > > > > > > -- > > Best regards, > > Andrey Kuznetsov. > --0000000000002f83bf057576a1c4--