From dev-return-39754-archive-asf-public=cust-asf.ponee.io@ignite.apache.org Thu Sep 27 15:00:54 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 3FA7A180652 for ; Thu, 27 Sep 2018 15:00:53 +0200 (CEST) Received: (qmail 77645 invoked by uid 500); 27 Sep 2018 13:00:52 -0000 Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list dev@ignite.apache.org Received: (qmail 77633 invoked by uid 99); 27 Sep 2018 13:00:51 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Sep 2018 13:00:51 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 39A95CAE68 for ; Thu, 27 Sep 2018 13:00:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.898 X-Spam-Level: * X-Spam-Status: No, score=1.898 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id IlwEaD3kI9uC for ; Thu, 27 Sep 2018 13:00:49 +0000 (UTC) Received: from mail-lj1-f170.google.com (mail-lj1-f170.google.com [209.85.208.170]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 9D4CA5F357 for ; Thu, 27 Sep 2018 13:00:48 +0000 (UTC) Received: by mail-lj1-f170.google.com with SMTP id y17-v6so2297783ljy.8 for ; Thu, 27 Sep 2018 06:00:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=3qHCjjYePyajGP4snNFjEJUH5oYoDTRjvOjh3Spc3X4=; b=eYg1TQTaxC7r/NLFQ8AClIQm0vyqa3HyUu6ou6CJ4QZC1Gdr4J4FPQrfksjfqZTmMr 8h2vIIObndJA6C5Ormnlb54R8Xlq6WK7tUFlf4ZabOQzt//0nBq7j4cr+Fr9/0EKBDAi VKCOOe6hKUDjAd+8Sa97IpSHVmdD109Rt14IeP2LTwbjTf7Bvg/6HBzEAY7qkPWXosL5 jlonnDG3hkWCFc+VEXTrJNySXW/nQbDVmLqDLxb/KtddyeIo8vd8uXaz4PPaoLesH2Yn TeI7oarhnR34coKXIhKKL/KsTEZ5+OlUwBPPMoY661xIJNH5YOmOBsFowAa6U9lz13aZ Gjkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=3qHCjjYePyajGP4snNFjEJUH5oYoDTRjvOjh3Spc3X4=; b=HL5z2gd7Ik/B8mpholmxxfaDafWXUels778yy7Db+4vpz93KZi1YgrxvWXVq3z2Y4j B0TVwdB3lhD3sxujWMnR33IlQFIEzWSjXA08m3jtZgONWN495sKA/SBM4Cq3CF36rDRf pIo8Pk7dTn++7wgMfmC19BtfEk6asKsXq9ODwJ5sWkRTgfC7Np5Gr57A53lJGKPMzotg GekyBafO661nbVwoMcZVW2ywnYgtUfZihR8o65jtYySiK9A28989SIO6Jd8s/qFZ56Wj zQqMADa/4kVaCMfjU6WoCYja0mVeEv7KzIgdi65y0Q8zomoNXljaLWnU10eiAeU5Q0+V ubHg== X-Gm-Message-State: ABuFfognzBGRgm/FvGoiQmw2X4Vr8YOWDMba6MszCw0ev/v7lctJVjZ4 ZZbfxE1oEuEzjDLG/HGDSsoqKwsYJN0Q1Z8J8700BA== X-Google-Smtp-Source: ACcGV60DZW8Q3bfSNoLIer62BnW/GeTHS8rNA+1u2x1PnTqhF/oQerquD3hOYqc/0bM0yRHu0m1jBe6IR4HWSz+cfQE= X-Received: by 2002:a2e:504b:: with SMTP id v11-v6mr8174866ljd.131.1538053247479; Thu, 27 Sep 2018 06:00:47 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Maxim Muzafarov Date: Thu, 27 Sep 2018 16:00:34 +0300 Message-ID: Subject: Re: Critical worker threads liveness checking drawbacks To: dev@ignite.apache.org Content-Type: multipart/alternative; boundary="000000000000764c220576d9edc5" --000000000000764c220576d9edc5 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Folks, I've found in `GridCachePartitionExchangeManager:2684` [1] (master branch) exchange future wrapped with double `blockingSectionEnd` method. Is it correct? I just want to understand this change and how should I use this in the future. Should I file a new issue to fix this? I think here `blockingSectionBegin` method should be used. ------------- blockingSectionEnd(); try { resVer =3D exchFut.get(exchTimeout, TimeUnit.MILLISECONDS); } finally { blockingSectionEnd(); } [1] https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org= /apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.= java#L2684 On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur wrote= : > Andrey Gura, thank you for the answer! > > I agree that wrapping of 'init' method reduces the profit of watchdog > service in case of PME worker, but in other cases, we should wrap all > possible long sections on GridDhtPartitionExchangeFuture. For example > 'onCacheChangeRequest' method or > 'cctx.affinity().onCacheChangeRequest' inside because it may take > significant time (reproducer attached). > > I only want to point out a possible issue which may allow to end-user > halt the Ignite cluster accidentally. > > I'm sure that PME experts know how to fix this issue properly. > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura wrote: > > > > Vyacheslav, > > > > Exchange worker is strongly tied with > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker also > > shouldn't be blocked for long time but in reality it happens.It also > > means that your change doesn't make sense. > > > > What actually make sense it is identification of places which > > intentionally blocking. May be some places/actions should be braced by > > blocking guards. > > > > If you have failing tests please make sure that your failureHandler is > > NoOpFailureHandler or any other handler with ignoreFailureTypes =3D > > [CRITICAL_WORKER_BLOCKED]. > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur > wrote: > > > > > > Hi Igniters! > > > > > > Thank you for this important improvement! > > > > > > I've looked through implementation and noticed that > > > GridDhtPartitionsExchangeFuture#init has not been wrapped in blocked > > > section. This means it easy to halt the node in case of longrunning > > > actions during PME, for example when we create a cache with > > > StoreFactrory which connect to 3rd party DB. > > > > > > I'm not sure that it is the right behavior. > > > > > > I filled the issue [1] and prepared the PR [2] with reproducer and > possible fix. > > > > > > Andrey, could you please look at and confirm that it makes sense? > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9710 > > > [2] https://github.com/apache/ignite/pull/4845 > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov > wrote: > > > > > > > > Denis, > > > > > > > > I've created the ticket [1] with short description of the > functionality. > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679 > > > > > > > > > > > > =D0=BF=D0=BD, 24 =D1=81=D0=B5=D0=BD=D1=82. 2018 =D0=B3. =D0=B2 17:4= 6, Denis Magda : > > > > > > > > > Andrey K. and G., > > > > > > > > > > Thanks, do we have a documentation ticket created? Prachi (copied= ) > can help > > > > > with the documentation. > > > > > > > > > > -- > > > > > Denis > > > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura > wrote: > > > > > > > > > > > Andrey, > > > > > > > > > > > > finally your change is merged to master branch. Congratulations > and > > > > > > thank you very much! :) > > > > > > > > > > > > I think that the next step is feature that will allow signal > about > > > > > > blocked threads to the monitoring tools via MXBean. > > > > > > > > > > > > I hope you will continue development of this feature and provid= e > your > > > > > > vision in new JIRA issue. > > > > > > > > > > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov < > stkuzma@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > David, Maxim! > > > > > > > > > > > > > > Thanks a lot for you ideas. Unfortunately, I can't adopt all > of them > > > > > > right > > > > > > > now: the scope is much broader than the scope of the change I > > > > > implement. > > > > > > I > > > > > > > have had a talk to a group of Ignite commiters, and we agreed > to > > > > > complete > > > > > > > the change as follows. > > > > > > > - Blocking instructions in system-critical which may resonabl= y > last > > > > > long > > > > > > > should be explicitly excluded from the monitoring. > > > > > > > - Failure handlers should have a setting to suppress some > failures on > > > > > > > per-failure-type basis. > > > > > > > According to this I have updated the implementation: [1] > > > > > > > > > > > > > > [1] https://github.com/apache/ignite/pull/4089 > > > > > > > > > > > > > > =D0=BF=D0=BD, 10 =D1=81=D0=B5=D0=BD=D1=82. 2018 =D0=B3. =D0= =B2 22:35, David Harvey < > syssoftsol@gmail.com>: > > > > > > > > > > > > > > > When I've done this before,I've needed to find the oldest > thread, > > > > > and > > > > > > kill > > > > > > > > the node running that. From a language standpoint, Maxim'= s > "without > > > > > > > > progress" better than "heartbeat". For example, what I'm > most > > > > > > interested > > > > > > > > in on a distributed system is which thread started the work > it has > > > > > not > > > > > > > > completed the earliest, and when did that thread last make > forward > > > > > > > > process. You don't want to kill a node because a thread > is > > > > > waiting > > > > > > on a > > > > > > > > lock held by a thread that went off-node and has not gotten= a > > > > > response. > > > > > > > > If you don't understand the dependency relationships, you > will make > > > > > > > > incorrect recovery decisions. > > > > > > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov < > maxmuzaf@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > I think we should find exact answers to these questions: > > > > > > > > > 1. What `critical` issue exactly is? > > > > > > > > > 2. How can we find critical issues? > > > > > > > > > 3. How can we handle critical issues? > > > > > > > > > > > > > > > > > > First, > > > > > > > > > - Ignore uninterruptable actions (e.g. worker\service > shutdown) > > > > > > > > > - Long I/O operations (should be a configurable timeout > for each > > > > > > type of > > > > > > > > > usage) > > > > > > > > > - Infinite loops > > > > > > > > > - Stalled\deadlocked threads (and\or too many parked > threads, > > > > > > exclude > > > > > > > > I/O) > > > > > > > > > > > > > > > > > > Second, > > > > > > > > > - The working queue is without progress (e.g. disco, > exchange > > > > > > queues) > > > > > > > > > - Work hasn't been completed since the last heartbeat > (checking > > > > > > > > > milestones) > > > > > > > > > - Too many system resources used by a thread for the lon= g > period > > > > > of > > > > > > time > > > > > > > > > (allocated memory, CPU) > > > > > > > > > - Timing fields associated with each thread status > exceeded a > > > > > > maximum > > > > > > > > time > > > > > > > > > limit. > > > > > > > > > > > > > > > > > > Third (not too many options here), > > > > > > > > > - `log everything` should be the default behaviour in al= l > these > > > > > > cases, > > > > > > > > > since it may be difficult to find the cause after the > restart. > > > > > > > > > - Wait some interval of time and kill the hanging node > (cluster > > > > > > should > > > > > > > > be > > > > > > > > > configured stable enough) > > > > > > > > > > > > > > > > > > Questions, > > > > > > > > > - Not sure, but can workers miss their heartbeat > deadlines if CPU > > > > > > loads > > > > > > > > up > > > > > > > > > to 80%-90%? Bursts of momentary overloads can be > > > > > > > > > expected behaviour as a normal part of system > operations. > > > > > > > > > - Why do we decide that critical thread should monitor > each other? > > > > > > For > > > > > > > > > instance, if all the tasks were blocked and unable to run= , > > > > > > > > > node reset would never occur. As for me, a better > solution is > > > > > to > > > > > > use > > > > > > > > a > > > > > > > > > separate monitor thread or pool (maybe both with software > > > > > > > > > and hardware checks) that not only checks heartbeats > but > > > > > > monitors the > > > > > > > > > other system as well. > > > > > > > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David Harvey < > syssoftsol@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > It would be safer to restart the entire cluster than to > remove > > > > > the > > > > > > last > > > > > > > > > > node for a cache that should be redundant. > > > > > > > > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura < > agura@apache.org> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > I agree with Yakov that we can provide some option > that manage > > > > > > worker > > > > > > > > > > > liveness checker behavior in case of observing that > some worker > > > > > > is > > > > > > > > > > > blocked too long. > > > > > > > > > > > At least it will some workaround for cases when node > fails is > > > > > > too > > > > > > > > > > > annoying. > > > > > > > > > > > > > > > > > > > > > > Backups count threshold sounds good but I don't > understand how > > > > > it > > > > > > > > will > > > > > > > > > > > help in case of cluster hanging. > > > > > > > > > > > > > > > > > > > > > > The simplest solution here is alert in cases of > blocking of > > > > > some > > > > > > > > > > > critical worker (we can improve WorkersRegistry for > this > > > > > purpose > > > > > > and > > > > > > > > > > > expose list of blocked workers) and optionally call > system > > > > > > configured > > > > > > > > > > > failure processor. BTW, failure processor can be > extended in > > > > > > order to > > > > > > > > > > > perform any checks (e.g. backup count) and decide > whether it > > > > > > should > > > > > > > > > > > stop node or not. > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov < > > > > > > stkuzma@gmail.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > David, Yakov, I understand your fears. But liveness > checks > > > > > deal > > > > > > > > with > > > > > > > > > > > > _critical_ conditions, i.e. when such a condition i= s > met we > > > > > > > > conclude > > > > > > > > > > the > > > > > > > > > > > > node as totally broken, and there is no sense to > keep it > > > > > alive > > > > > > > > > > regardless > > > > > > > > > > > > the data it contains. If we want to give it a > chance, then > > > > > the > > > > > > > > > > condition > > > > > > > > > > > > (long fsync etc.) should not considered as critical > at all. > > > > > > > > > > > > > > > > > > > > > > > > =D1=81=D0=B1, 8 =D1=81=D0=B5=D0=BD=D1=82. 2018 =D0= =B3. =D0=B2 15:18, Yakov Zhdanov < > > > > > > yzhdanov@apache.org>: > > > > > > > > > > > > > > > > > > > > > > > > > Agree with David. We need to have an opporunity > set backups > > > > > > count > > > > > > > > > > > threshold > > > > > > > > > > > > > (at runtime also!) that will not allow any > automatic stop > > > > > if > > > > > > > > there > > > > > > > > > > > will be > > > > > > > > > > > > > a data loss. Andrey, what do you think? > > > > > > > > > > > > > > > > > > > > > > > > > > --Yakov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Best regards, > > > > > > > > > > > > Andrey Kuznetsov. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > -- > > > > > > > > > Maxim Muzafarov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Best regards, > > > > > > > Andrey Kuznetsov. > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best regards, > > > > Andrey Kuznetsov. > > > > > > > > > > > > -- > > > Best Regards, Vyacheslav D. > > > > -- > Best Regards, Vyacheslav D. > --=20 -- Maxim Muzafarov --000000000000764c220576d9edc5--