Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 28838200D41 for ; Wed, 22 Nov 2017 22:43:15 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 2783D160BFD; Wed, 22 Nov 2017 21:43:15 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 6E99F160BEC for ; Wed, 22 Nov 2017 22:43:14 +0100 (CET) Received: (qmail 69087 invoked by uid 500); 22 Nov 2017 21:43:13 -0000 Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list dev@ignite.apache.org Received: (qmail 69075 invoked by uid 99); 22 Nov 2017 21:43:13 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Nov 2017 21:43:13 +0000 Received: from [192.168.75.66] (c-67-160-238-197.hsd1.ca.comcast.net [67.160.238.197]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id E1BE41A0048 for ; Wed, 22 Nov 2017 21:43:12 +0000 (UTC) From: Denis Magda Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Ignite Enhancement Proposal #7 (Internal problems detection) Date: Wed, 22 Nov 2017 13:43:10 -0800 References: To: dev@ignite.apache.org In-Reply-To: Message-Id: <374A6B8D-541A-4C18-B166-E6B4BC59030C@apache.org> X-Mailer: Apple Mail (2.3273) archived-at: Wed, 22 Nov 2017 21:43:15 -0000 Just provide FailureProcessingPolicy with possible reactions: - NOOP - exceptions will be reported, metrics will be triggered but an = affected Ignite process won=E2=80=99t be touched. - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite = process termination. - RESTART - NOOP actions + process restart. - EXEC - execute a custom script provided by the user. If needed the policy can be set per know failure such is OOM, = Persistence errors so that the user can act accordingly basing on a = context. =E2=80=94 Denis > On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov = wrote: >=20 > In the first iteration I would focus only on reporting facilities, to = let > administrator spot dangerous situation. And in the second phase, when = all > reporting and metrics are ready, we can think on some automatic = actions. >=20 > On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov = > wrote: >=20 >> Hi Anton, >>=20 >> I don't think that we should shutdown node in case of = IgniteOOMException, >> if one node has no space, then other probably don't have it too, so = re >> -balancing will cause IgniteOOM on all other nodes and will kill the = whole >> cluster. I think for some configurations cluster should survive and = allow >> to user clean cache or/and add more nodes. >>=20 >> Thanks, >> Mikhail. >>=20 >> 20 =D0=BD=D0=BE=D1=8F=D0=B1. 2017 =D0=B3. 6:53 =D0=9F=D0=9F = =D0=BF=D0=BE=D0=BB=D1=8C=D0=B7=D0=BE=D0=B2=D0=B0=D1=82=D0=B5=D0=BB=D1=8C = "Anton Vinogradov" < >> avinogradov@gridgain.com> =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB: >>=20 >>> Igniters, >>>=20 >>> Internal problems may and, unfortunately, cause unexpected cluster >>> behavior. >>> We should determine behavior in case any of internal problem = happened. >>>=20 >>> Well known internal problems can be split to: >>> 1) OOM or any other reason cause node crash >>>=20 >>> 2) Situations required graceful node shutdown with custom = notification >>> - IgniteOutOfMemoryException >>> - Persistence errors >>> - ExchangeWorker exits with error >>>=20 >>> 3) Prefomance issues should be covered by metrics >>> - GC STW duration >>> - Timed out tasks and jobs >>> - TX deadlock >>> - Hanged Tx (waits for some service) >>> - Java Deadlocks >>>=20 >>> I created special issue [1] to make sure all these metrics will be >>> presented at WebConsole or VisorConsole (what's preferred?) >>>=20 >>> 4) Situations required external monitoring implementation >>> - GC STW duration exceed maximum possible length (node should be = stopped >>> before STW finished) >>>=20 >>> All this problems were reported by different persons different time = ago, >>> So, we should reanalyze each of them and, possible, find better ways = to >>> solve them than it described at issues. >>>=20 >>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention = something >>> else :) >>>=20 >>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 >>> [2] >>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- >>> 7%3A+Ignite+internal+problems+detection >>>=20 >>=20