From dev-return-38803-archive-asf-public=cust-asf.ponee.io@ignite.apache.org  Fri Sep  7 18:10:10 2018
Return-Path: <dev-return-38803-archive-asf-public=cust-asf.ponee.io@ignite.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 6157318077A
	for <archive-asf-public@cust-asf.ponee.io>; Fri,  7 Sep 2018 18:10:10 +0200 (CEST)
Received: (qmail 26784 invoked by uid 500); 7 Sep 2018 16:10:09 -0000
Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@ignite.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@ignite.apache.org>
List-Post: <mailto:dev@ignite.apache.org>
List-Id: <dev.ignite.apache.org>
Reply-To: dev@ignite.apache.org
Delivered-To: mailing list dev@ignite.apache.org
Received: (qmail 26773 invoked by uid 99); 7 Sep 2018 16:10:09 -0000
Received: from mail-relay.apache.org (HELO mailrelay1-lw-us.apache.org) (207.244.88.152)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Sep 2018 16:10:09 +0000
Received: from mail-lj1-f175.google.com (mail-lj1-f175.google.com [209.85.208.175])
	by mailrelay1-lw-us.apache.org (ASF Mail Server at mailrelay1-lw-us.apache.org) with ESMTPSA id 67A06D32
	for <dev@ignite.apache.org>; Fri,  7 Sep 2018 16:10:08 +0000 (UTC)
Received: by mail-lj1-f175.google.com with SMTP id p6-v6so12732551ljc.5
        for <dev@ignite.apache.org>; Fri, 07 Sep 2018 09:10:08 -0700 (PDT)
X-Gm-Message-State: APzg51CPzL3SPa+krioc7iUR/opWjuR8LSlpIBc5L1v8fTMDKp8iZzrh
	R/NPvv6TfK0oHkY47DmDsXhy8FTwsJMpNqYXL7P+ww==
X-Google-Smtp-Source: ANB0VdYW/wVvbQ41C4olBxbxOs/XraCJzI/ikmL0BME9vkyYur/O5629mgKf4dfb1BFKUAm39cjOMJP6yPxSowLzhkY=
X-Received: by 2002:a2e:590e:: with SMTP id n14-v6mr5215607ljb.128.1536336607216;
 Fri, 07 Sep 2018 09:10:07 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a2e:5d02:0:0:0:0:0 with HTTP; Fri, 7 Sep 2018 09:10:06 -0700 (PDT)
In-Reply-To: <CA+Bp9OCwoKLYWdVWEGphdxrUosLsBgt7gmhNrs9U0+VjjkS2gA@mail.gmail.com>
References: <CA+Bp9OBgFExcLdv0D6EmD5=Mj34gct7H=sWyjNQrGWKuo5xcTA@mail.gmail.com>
 <CAGcMBHho-0rLB7h=FFq430WFgETv6c4TN5er2T5i0QPQ_gNyCg@mail.gmail.com> <CA+Bp9OCwoKLYWdVWEGphdxrUosLsBgt7gmhNrs9U0+VjjkS2gA@mail.gmail.com>
From: Yakov Zhdanov <yzhdanov@apache.org>
Date: Fri, 7 Sep 2018 19:10:06 +0300
X-Gmail-Original-Message-ID: <CAGcMBHjoFv1KGkvfVsSmzm5LHkDHsRHNkS_uVagf=42Z96u19A@mail.gmail.com>
Message-ID: <CAGcMBHjoFv1KGkvfVsSmzm5LHkDHsRHNkS_uVagf=42Z96u19A@mail.gmail.com>
Subject: Re: Critical worker threads liveness checking drawbacks
To: dev@ignite.apache.org
Content-Type: multipart/alternative; boundary="000000000000ba9eae05754a3d83"

--000000000000ba9eae05754a3d83
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Yes, and you should suggest solution, e.g. throttle rebalancing threads
more to produce less load.

What you suggesting kills the idea of this enhancement.

--Yakov

2018-09-07 19:03 GMT+03:00 Andrey Kuznetsov <stkuzma@gmail.com>:

> Yakov,
>
> Thanks for reply. Indeed, initial design assumed node termination when
> hanging critical thread has been detected. But sometimes it looks
> inappropriate. Let, for example fsync in WAL writer thread takes too long=
,
> and we terminate the node. Upon rebalancing, this may lead to long fsyncs
> on other nodes due to increased per node load, hence we can terminate the
> next node as well. Eventually we can collapse the entire cluster. Is it a
> possible scenario?
>
> =D0=BF=D1=82, 7 =D1=81=D0=B5=D0=BD=D1=82. 2018 =D0=B3. =D0=B2 18:44, Yako=
v Zhdanov <yzhdanov@apache.org>:
>
> > Andrey,
> >
> > I don't understand your point. My opinion, the idea of these changes is
> to
> > make cluster more stable and responsive by eliminating hanged nodes. I
> > would not make too much difference between threads trapped in deadlock
> and
> > threads hanging on fsync calls for too long. Both situations lead to
> > increasing latency in cluster till its full unavailability.
> >
> > So, killing node hanging on fsync may be reasonable. Agree?
> >
> > You may implement the approach when you have warning messages in logs b=
y
> > default, but termination option should also be available.
> >
> > Thanks!
> >
> > --Yakov
> >
> >
>

--000000000000ba9eae05754a3d83--