From dev-return-38803-archive-asf-public=cust-asf.ponee.io@ignite.apache.org Fri Sep 7 18:10:10 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 6157318077A for ; Fri, 7 Sep 2018 18:10:10 +0200 (CEST) Received: (qmail 26784 invoked by uid 500); 7 Sep 2018 16:10:09 -0000 Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list dev@ignite.apache.org Received: (qmail 26773 invoked by uid 99); 7 Sep 2018 16:10:09 -0000 Received: from mail-relay.apache.org (HELO mailrelay1-lw-us.apache.org) (207.244.88.152) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Sep 2018 16:10:09 +0000 Received: from mail-lj1-f175.google.com (mail-lj1-f175.google.com [209.85.208.175]) by mailrelay1-lw-us.apache.org (ASF Mail Server at mailrelay1-lw-us.apache.org) with ESMTPSA id 67A06D32 for ; Fri, 7 Sep 2018 16:10:08 +0000 (UTC) Received: by mail-lj1-f175.google.com with SMTP id p6-v6so12732551ljc.5 for ; Fri, 07 Sep 2018 09:10:08 -0700 (PDT) X-Gm-Message-State: APzg51CPzL3SPa+krioc7iUR/opWjuR8LSlpIBc5L1v8fTMDKp8iZzrh R/NPvv6TfK0oHkY47DmDsXhy8FTwsJMpNqYXL7P+ww== X-Google-Smtp-Source: ANB0VdYW/wVvbQ41C4olBxbxOs/XraCJzI/ikmL0BME9vkyYur/O5629mgKf4dfb1BFKUAm39cjOMJP6yPxSowLzhkY= X-Received: by 2002:a2e:590e:: with SMTP id n14-v6mr5215607ljb.128.1536336607216; Fri, 07 Sep 2018 09:10:07 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a2e:5d02:0:0:0:0:0 with HTTP; Fri, 7 Sep 2018 09:10:06 -0700 (PDT) In-Reply-To: References: From: Yakov Zhdanov Date: Fri, 7 Sep 2018 19:10:06 +0300 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Critical worker threads liveness checking drawbacks To: dev@ignite.apache.org Content-Type: multipart/alternative; boundary="000000000000ba9eae05754a3d83" --000000000000ba9eae05754a3d83 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Yes, and you should suggest solution, e.g. throttle rebalancing threads more to produce less load. What you suggesting kills the idea of this enhancement. --Yakov 2018-09-07 19:03 GMT+03:00 Andrey Kuznetsov : > Yakov, > > Thanks for reply. Indeed, initial design assumed node termination when > hanging critical thread has been detected. But sometimes it looks > inappropriate. Let, for example fsync in WAL writer thread takes too long= , > and we terminate the node. Upon rebalancing, this may lead to long fsyncs > on other nodes due to increased per node load, hence we can terminate the > next node as well. Eventually we can collapse the entire cluster. Is it a > possible scenario? > > =D0=BF=D1=82, 7 =D1=81=D0=B5=D0=BD=D1=82. 2018 =D0=B3. =D0=B2 18:44, Yako= v Zhdanov : > > > Andrey, > > > > I don't understand your point. My opinion, the idea of these changes is > to > > make cluster more stable and responsive by eliminating hanged nodes. I > > would not make too much difference between threads trapped in deadlock > and > > threads hanging on fsync calls for too long. Both situations lead to > > increasing latency in cluster till its full unavailability. > > > > So, killing node hanging on fsync may be reasonable. Agree? > > > > You may implement the approach when you have warning messages in logs b= y > > default, but termination option should also be available. > > > > Thanks! > > > > --Yakov > > > > > --000000000000ba9eae05754a3d83--