From dev-return-39754-archive-asf-public=cust-asf.ponee.io@ignite.apache.org  Thu Sep 27 15:00:54 2018
Return-Path: <dev-return-39754-archive-asf-public=cust-asf.ponee.io@ignite.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 3FA7A180652
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 27 Sep 2018 15:00:53 +0200 (CEST)
Received: (qmail 77645 invoked by uid 500); 27 Sep 2018 13:00:52 -0000
Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@ignite.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@ignite.apache.org>
List-Post: <mailto:dev@ignite.apache.org>
List-Id: <dev.ignite.apache.org>
Reply-To: dev@ignite.apache.org
Delivered-To: mailing list dev@ignite.apache.org
Received: (qmail 77633 invoked by uid 99); 27 Sep 2018 13:00:51 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Sep 2018 13:00:51 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 39A95CAE68
	for <dev@ignite.apache.org>; Thu, 27 Sep 2018 13:00:51 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.898
X-Spam-Level: *
X-Spam-Status: No, score=1.898 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001,
	SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id IlwEaD3kI9uC for <dev@ignite.apache.org>;
	Thu, 27 Sep 2018 13:00:49 +0000 (UTC)
Received: from mail-lj1-f170.google.com (mail-lj1-f170.google.com [209.85.208.170])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 9D4CA5F357
	for <dev@ignite.apache.org>; Thu, 27 Sep 2018 13:00:48 +0000 (UTC)
Received: by mail-lj1-f170.google.com with SMTP id y17-v6so2297783ljy.8
        for <dev@ignite.apache.org>; Thu, 27 Sep 2018 06:00:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=3qHCjjYePyajGP4snNFjEJUH5oYoDTRjvOjh3Spc3X4=;
        b=eYg1TQTaxC7r/NLFQ8AClIQm0vyqa3HyUu6ou6CJ4QZC1Gdr4J4FPQrfksjfqZTmMr
         8h2vIIObndJA6C5Ormnlb54R8Xlq6WK7tUFlf4ZabOQzt//0nBq7j4cr+Fr9/0EKBDAi
         VKCOOe6hKUDjAd+8Sa97IpSHVmdD109Rt14IeP2LTwbjTf7Bvg/6HBzEAY7qkPWXosL5
         jlonnDG3hkWCFc+VEXTrJNySXW/nQbDVmLqDLxb/KtddyeIo8vd8uXaz4PPaoLesH2Yn
         TeI7oarhnR34coKXIhKKL/KsTEZ5+OlUwBPPMoY661xIJNH5YOmOBsFowAa6U9lz13aZ
         Gjkw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=3qHCjjYePyajGP4snNFjEJUH5oYoDTRjvOjh3Spc3X4=;
        b=HL5z2gd7Ik/B8mpholmxxfaDafWXUels778yy7Db+4vpz93KZi1YgrxvWXVq3z2Y4j
         B0TVwdB3lhD3sxujWMnR33IlQFIEzWSjXA08m3jtZgONWN495sKA/SBM4Cq3CF36rDRf
         pIo8Pk7dTn++7wgMfmC19BtfEk6asKsXq9ODwJ5sWkRTgfC7Np5Gr57A53lJGKPMzotg
         GekyBafO661nbVwoMcZVW2ywnYgtUfZihR8o65jtYySiK9A28989SIO6Jd8s/qFZ56Wj
         zQqMADa/4kVaCMfjU6WoCYja0mVeEv7KzIgdi65y0Q8zomoNXljaLWnU10eiAeU5Q0+V
         ubHg==
X-Gm-Message-State: ABuFfognzBGRgm/FvGoiQmw2X4Vr8YOWDMba6MszCw0ev/v7lctJVjZ4
	ZZbfxE1oEuEzjDLG/HGDSsoqKwsYJN0Q1Z8J8700BA==
X-Google-Smtp-Source: ACcGV60DZW8Q3bfSNoLIer62BnW/GeTHS8rNA+1u2x1PnTqhF/oQerquD3hOYqc/0bM0yRHu0m1jBe6IR4HWSz+cfQE=
X-Received: by 2002:a2e:504b:: with SMTP id v11-v6mr8174866ljd.131.1538053247479;
 Thu, 27 Sep 2018 06:00:47 -0700 (PDT)
MIME-Version: 1.0
References: <CA+Bp9OBgFExcLdv0D6EmD5=Mj34gct7H=sWyjNQrGWKuo5xcTA@mail.gmail.com>
 <CAHoCsiWwCRsVzW2KqLK7JWOyTjH8=yOiN=trC3BOZ6xyxMcM7g@mail.gmail.com>
 <CAGcMBHho60VBmibJryOCdQ=Hs87GoVoFOYFMGQxgZX5345582A@mail.gmail.com>
 <CA+Bp9OB-gmugvpQ1Wu3ymU-TFNaADW_hyFOK9hB_Z4zYuKpmKQ@mail.gmail.com>
 <CAK1mX7HK0wqf+5YqubzMg0pJ7jx_axdMmRHe8R2YM_JZXy6=Gg@mail.gmail.com>
 <CAHoCsiV=j9Mpcn142QsBFhCWL4DEsPq6GNZ2P8+gutUC=iTGNg@mail.gmail.com>
 <CADiQCW+1oZ6YUbjHPPdqwHy+iZvw8EttGQm2i7E4OJyFOdwm-w@mail.gmail.com>
 <CAHoCsiWrR6De-Hc=YXQHerKvLWJJrMERUZD9_zmwYCHYki9SOw@mail.gmail.com>
 <CA+Bp9OADf3Z9eTpjgS8cHWWYdLyQCqrEKGdQ1Z1uPh5dRao8UQ@mail.gmail.com>
 <CAK1mX7Ew7kJUGHeypJ4ah_rAyGeAHtkGGcZhu83wTtC=3PQqtA@mail.gmail.com>
 <CAK0qHnrUY9weEsoK8Y4o+fJu+eEHFA3u_4xNxt7ySdjxX5qrdg@mail.gmail.com>
 <CA+Bp9OCQfQS8rCuxputyfHYwuEOsVU_cETvJYhw-2mAMOHMHPg@mail.gmail.com>
 <CAFyTW-hxcptP7uhCfKK6KoCg9V_XmgpVRndD8H0G+bBywSOQwg@mail.gmail.com>
 <CAK1mX7GD=3kjiryMU4o5UK_ZpXarjNeqREPKn3QJx7kdkB7krw@mail.gmail.com> <CAFyTW-jaFODuhb9vGG2csq481qc=XbK+3N6xsmrJh3HvmPGBMg@mail.gmail.com>
In-Reply-To: <CAFyTW-jaFODuhb9vGG2csq481qc=XbK+3N6xsmrJh3HvmPGBMg@mail.gmail.com>
From: Maxim Muzafarov <maxmuzaf@gmail.com>
Date: Thu, 27 Sep 2018 16:00:34 +0300
Message-ID: <CADiQCWL-dRzMuFvF7qPtGAcS_xQrFhxjOg_8tKgHg0bA=URk1g@mail.gmail.com>
Subject: Re: Critical worker threads liveness checking drawbacks
To: dev@ignite.apache.org
Content-Type: multipart/alternative; boundary="000000000000764c220576d9edc5"

--000000000000764c220576d9edc5
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Folks,

I've found in `GridCachePartitionExchangeManager:2684` [1] (master branch)
exchange future wrapped
with double `blockingSectionEnd` method. Is it correct? I just want to
understand this change and
how should I use this in the future.

Should I file a new issue to fix this? I think here `blockingSectionBegin`
method should be used.

-------------
blockingSectionEnd();

try {
    resVer =3D exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
} finally {
    blockingSectionEnd();
}


[1]
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org=
/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.=
java#L2684

On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <daradurvs@gmail.com> wrote=
:

> Andrey Gura, thank you for the answer!
>
> I agree that wrapping of 'init' method reduces the profit of watchdog
> service in case of PME worker, but in other cases, we should wrap all
> possible long sections on GridDhtPartitionExchangeFuture. For example
> 'onCacheChangeRequest' method or
> 'cctx.affinity().onCacheChangeRequest' inside because it may take
> significant time (reproducer attached).
>
> I only want to point out a possible issue which may allow to end-user
> halt the Ignite cluster accidentally.
>
> I'm sure that PME experts know how to fix this issue properly.
> On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <agura@apache.org> wrote:
> >
> > Vyacheslav,
> >
> > Exchange worker is strongly tied with
> > GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker also
> > shouldn't be blocked for long time but in reality it happens.It also
> > means that your change doesn't make sense.
> >
> > What actually make sense it is identification of places which
> > intentionally blocking. May be some places/actions should be braced by
> > blocking guards.
> >
> > If you have failing tests please make sure that your failureHandler is
> > NoOpFailureHandler or any other handler with ignoreFailureTypes =3D
> > [CRITICAL_WORKER_BLOCKED].
> >
> >
> > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <daradurvs@gmail.com=
>
> wrote:
> > >
> > > Hi Igniters!
> > >
> > > Thank you for this important improvement!
> > >
> > > I've looked through implementation and noticed that
> > > GridDhtPartitionsExchangeFuture#init has not been wrapped in blocked
> > > section. This means it easy to halt the node in case of longrunning
> > > actions during PME, for example when we create a cache with
> > > StoreFactrory which connect to 3rd party DB.
> > >
> > > I'm not sure that it is the right behavior.
> > >
> > > I filled the issue [1] and prepared the PR [2] with reproducer and
> possible fix.
> > >
> > > Andrey, could you please look at and confirm that it makes sense?
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > > [2] https://github.com/apache/ignite/pull/4845
> > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <stkuzma@gmail.com>
> wrote:
> > > >
> > > > Denis,
> > > >
> > > > I've created the ticket [1] with short description of the
> functionality.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> > > >
> > > >
> > > > =D0=BF=D0=BD, 24 =D1=81=D0=B5=D0=BD=D1=82. 2018 =D0=B3. =D0=B2 17:4=
6, Denis Magda <dmagda@apache.org>:
> > > >
> > > > > Andrey K. and G.,
> > > > >
> > > > > Thanks, do we have a documentation ticket created? Prachi (copied=
)
> can help
> > > > > with the documentation.
> > > > >
> > > > > --
> > > > > Denis
> > > > >
> > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <agura@apache.org>
> wrote:
> > > > >
> > > > > > Andrey,
> > > > > >
> > > > > > finally your change is merged to master branch. Congratulations
> and
> > > > > > thank you very much! :)
> > > > > >
> > > > > > I think that the next step is feature that will allow signal
> about
> > > > > > blocked threads to the monitoring tools via MXBean.
> > > > > >
> > > > > > I hope you will continue development of this feature and provid=
e
> your
> > > > > > vision in new JIRA issue.
> > > > > >
> > > > > >
> > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov <
> stkuzma@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > David, Maxim!
> > > > > > >
> > > > > > > Thanks a lot for you ideas. Unfortunately, I can't adopt all
> of them
> > > > > > right
> > > > > > > now: the scope is much broader than the scope of the change I
> > > > > implement.
> > > > > > I
> > > > > > > have had a talk to a group of Ignite commiters, and we agreed
> to
> > > > > complete
> > > > > > > the change as follows.
> > > > > > > - Blocking instructions in system-critical which may resonabl=
y
> last
> > > > > long
> > > > > > > should be explicitly excluded from the monitoring.
> > > > > > > - Failure handlers should have a setting to suppress some
> failures on
> > > > > > > per-failure-type basis.
> > > > > > > According to this I have updated the implementation: [1]
> > > > > > >
> > > > > > > [1] https://github.com/apache/ignite/pull/4089
> > > > > > >
> > > > > > > =D0=BF=D0=BD, 10 =D1=81=D0=B5=D0=BD=D1=82. 2018 =D0=B3. =D0=
=B2 22:35, David Harvey <
> syssoftsol@gmail.com>:
> > > > > > >
> > > > > > > > When I've done this before,I've needed to find the oldest
> thread,
> > > > > and
> > > > > > kill
> > > > > > > > the node running that.   From a language standpoint, Maxim'=
s
> "without
> > > > > > > > progress" better than "heartbeat".   For example, what I'm
> most
> > > > > > interested
> > > > > > > > in on a distributed system is which thread started the work
> it has
> > > > > not
> > > > > > > > completed the earliest, and when did that thread last make
> forward
> > > > > > > > process.     You don't want to kill a node because a thread
> is
> > > > > waiting
> > > > > > on a
> > > > > > > > lock held by a thread that went off-node and has not gotten=
 a
> > > > > response.
> > > > > > > > If you don't understand the dependency relationships, you
> will make
> > > > > > > > incorrect recovery decisions.
> > > > > > > >
> > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov <
> maxmuzaf@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I think we should find exact answers to these questions:
> > > > > > > > >  1. What `critical` issue exactly is?
> > > > > > > > >  2. How can we find critical issues?
> > > > > > > > >  3. How can we handle critical issues?
> > > > > > > > >
> > > > > > > > > First,
> > > > > > > > >  - Ignore uninterruptable actions (e.g. worker\service
> shutdown)
> > > > > > > > >  - Long I/O operations (should be a configurable timeout
> for each
> > > > > > type of
> > > > > > > > > usage)
> > > > > > > > >  - Infinite loops
> > > > > > > > >  - Stalled\deadlocked threads (and\or too many parked
> threads,
> > > > > > exclude
> > > > > > > > I/O)
> > > > > > > > >
> > > > > > > > > Second,
> > > > > > > > >  - The working queue is without progress (e.g. disco,
> exchange
> > > > > > queues)
> > > > > > > > >  - Work hasn't been completed since the last heartbeat
> (checking
> > > > > > > > > milestones)
> > > > > > > > >  - Too many system resources used by a thread for the lon=
g
> period
> > > > > of
> > > > > > time
> > > > > > > > > (allocated memory, CPU)
> > > > > > > > >  - Timing fields associated with each thread status
> exceeded a
> > > > > > maximum
> > > > > > > > time
> > > > > > > > > limit.
> > > > > > > > >
> > > > > > > > > Third (not too many options here),
> > > > > > > > >  - `log everything` should be the default behaviour in al=
l
> these
> > > > > > cases,
> > > > > > > > > since it may be difficult to find the cause after the
> restart.
> > > > > > > > >  - Wait some interval of time and kill the hanging node
> (cluster
> > > > > > should
> > > > > > > > be
> > > > > > > > > configured stable enough)
> > > > > > > > >
> > > > > > > > > Questions,
> > > > > > > > >  - Not sure, but can workers miss their heartbeat
> deadlines if CPU
> > > > > > loads
> > > > > > > > up
> > > > > > > > > to 80%-90%? Bursts of momentary overloads can be
> > > > > > > > >     expected behaviour as a normal part of system
> operations.
> > > > > > > > >  - Why do we decide that critical thread should monitor
> each other?
> > > > > > For
> > > > > > > > > instance, if all the tasks were blocked and unable to run=
,
> > > > > > > > >     node reset would never occur. As for me, a better
> solution is
> > > > > to
> > > > > > use
> > > > > > > > a
> > > > > > > > > separate monitor thread or pool (maybe both with software
> > > > > > > > >     and hardware checks) that not only checks heartbeats
> but
> > > > > > monitors the
> > > > > > > > > other system as well.
> > > > > > > > >
> > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David Harvey <
> syssoftsol@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > It would be safer to restart the entire cluster than to
> remove
> > > > > the
> > > > > > last
> > > > > > > > > > node for a cache that should be redundant.
> > > > > > > > > >
> > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <
> agura@apache.org>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > I agree with Yakov that we can provide some option
> that manage
> > > > > > worker
> > > > > > > > > > > liveness checker behavior in case of observing that
> some worker
> > > > > > is
> > > > > > > > > > > blocked too long.
> > > > > > > > > > > At least it will  some workaround for cases when node
> fails is
> > > > > > too
> > > > > > > > > > > annoying.
> > > > > > > > > > >
> > > > > > > > > > > Backups count threshold sounds good but I don't
> understand how
> > > > > it
> > > > > > > > will
> > > > > > > > > > > help in case of cluster hanging.
> > > > > > > > > > >
> > > > > > > > > > > The simplest solution here is alert in cases of
> blocking of
> > > > > some
> > > > > > > > > > > critical worker (we can improve WorkersRegistry for
> this
> > > > > purpose
> > > > > > and
> > > > > > > > > > > expose list of blocked workers) and optionally call
> system
> > > > > > configured
> > > > > > > > > > > failure processor. BTW, failure processor can be
> extended in
> > > > > > order to
> > > > > > > > > > > perform any checks (e.g. backup count) and decide
> whether it
> > > > > > should
> > > > > > > > > > > stop node or not.
> > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <
> > > > > > stkuzma@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > David, Yakov, I understand your fears. But liveness
> checks
> > > > > deal
> > > > > > > > with
> > > > > > > > > > > > _critical_ conditions, i.e. when such a condition i=
s
> met we
> > > > > > > > conclude
> > > > > > > > > > the
> > > > > > > > > > > > node as totally broken, and there is no sense to
> keep it
> > > > > alive
> > > > > > > > > > regardless
> > > > > > > > > > > > the data it contains. If we want to give it a
> chance, then
> > > > > the
> > > > > > > > > > condition
> > > > > > > > > > > > (long fsync etc.) should not considered as critical
> at all.
> > > > > > > > > > > >
> > > > > > > > > > > > =D1=81=D0=B1, 8 =D1=81=D0=B5=D0=BD=D1=82. 2018 =D0=
=B3. =D0=B2 15:18, Yakov Zhdanov <
> > > > > > yzhdanov@apache.org>:
> > > > > > > > > > > >
> > > > > > > > > > > > > Agree with David. We need to have an opporunity
> set backups
> > > > > > count
> > > > > > > > > > > threshold
> > > > > > > > > > > > > (at runtime also!) that will not allow any
> automatic stop
> > > > > if
> > > > > > > > there
> > > > > > > > > > > will be
> > > > > > > > > > > > > a data loss. Andrey, what do you think?
> > > > > > > > > > > > >
> > > > > > > > > > > > > --Yakov
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > >   Andrey Kuznetsov.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > --
> > > > > > > > > --
> > > > > > > > > Maxim Muzafarov
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > >   Andrey Kuznetsov.
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >   Andrey Kuznetsov.
> > >
> > >
> > >
> > > --
> > > Best Regards, Vyacheslav D.
>
>
>
> --
> Best Regards, Vyacheslav D.
>
--=20
--
Maxim Muzafarov

--000000000000764c220576d9edc5--