Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id F14FB200D60 for ; Fri, 1 Dec 2017 14:27:40 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id F013A160C06; Fri, 1 Dec 2017 13:27:40 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 18574160BFB for ; Fri, 1 Dec 2017 14:27:39 +0100 (CET) Received: (qmail 79483 invoked by uid 500); 1 Dec 2017 13:27:39 -0000 Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list dev@ignite.apache.org Received: (qmail 79471 invoked by uid 99); 1 Dec 2017 13:27:38 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Dec 2017 13:27:38 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 24928C67C5 for ; Fri, 1 Dec 2017 13:27:38 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.492 X-Spam-Level: *** X-Spam-Status: No, score=3.492 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, KB_WAM_FROM_NAME_SINGLEWORD=0.2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URI_HEX=1.313] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gridgain-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id gvrEqzoNxJdX for ; Fri, 1 Dec 2017 13:27:37 +0000 (UTC) Received: from mail-ua0-f181.google.com (mail-ua0-f181.google.com [209.85.217.181]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id AD3E85F242 for ; Fri, 1 Dec 2017 13:27:36 +0000 (UTC) Received: by mail-ua0-f181.google.com with SMTP id t24so7964675uaa.13 for ; Fri, 01 Dec 2017 05:27:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gridgain-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=cfClTm7OuB7SGGFH3ZPO0X74va2yyL/TdtUKBdDo4wo=; b=IvuEHVWyw76pUFDkn9ZIltXv/x0/64yCRd0W2nqbKNY75dkN7I1vv466W0CFm572Nr 21Rx2TuBxkZDpZIm5zwsr9beF4u3Ad1/S8l+xIfsOfec81KUavbCwNO5r3KyrkciKCxa GVcZ0PG8I7fcaHfqUvl9qjGAstNjpAq4jHJOvlsMoEb0/hIDCI7UQ8SJACxpg8FGCCPh kFIZnkfJHwzgFU46K40wSVOmtgvFCD8fdz7n8x2NklPh+MzHhcwncACMPE7LKxWGn+Lj p+9Fcinr/jxHbv43uC6muy02Q/FDRS0abxi7v7+r2Sis/8e3wSy/S4g2kDCzRMPpsK9D s1Fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=cfClTm7OuB7SGGFH3ZPO0X74va2yyL/TdtUKBdDo4wo=; b=XXpYx1IvurZHbKPuRxhMrXjOOCzvHRViocSRwTahPxoBzTheL9mw2LLCfzm5xMU23x gOcb4i2ITIjzoHNCeK24sdd6ezO5NVsdrJEA6Q4ER5VMqM9Wz5XVJBaLyVBTQO9j77wB FW+wc+dkrgrDTzNNH+WSvB2ZKltbBtdzLQdiGmgeBYqEXpe68WUDRcqWYGuQC7wGuWxp RDOIEa1yQoAIpnNEiaNiP9mOUJ847rDQ+fC64diWm+Px0WFoU72Iz2qgf556k95j4Vy5 7fve838V0PEBSvGNfeJN4xXhgxWSxlzyFE75kWfZ6yCrGUyFQRJ+YCJnBIUZtnWy+gbV svBg== X-Gm-Message-State: AJaThX6bS6aqlSjr5WTS7z8MQmtg53+BmhUkCprYF5ybkIQZ+OAcjEKW rKuNLY0K4s9QXMb5/OmqkXkRkU4y4MVW2ysvwpXc3P8f X-Google-Smtp-Source: AGs4zMahw9ZkpiuQ932lIPsc0ss8QSepJx7yvq2RXjkrvR5NZiAVu6BAd0EdZM8Z1EaUAG4Vs55Qmdebd/uobd9vqfY= X-Received: by 10.176.95.138 with SMTP id b10mr4748475uaj.55.1512134856034; Fri, 01 Dec 2017 05:27:36 -0800 (PST) MIME-Version: 1.0 Received: by 10.159.48.150 with HTTP; Fri, 1 Dec 2017 05:27:35 -0800 (PST) In-Reply-To: References: From: Vladimir Ozerov Date: Fri, 1 Dec 2017 16:27:35 +0300 Message-ID: Subject: Re: Internal problems requiring graceful node shutdown, reboot, etc. To: dev@ignite.apache.org Content-Type: multipart/alternative; boundary="089e0822c7a0f2655b055f475437" archived-at: Fri, 01 Dec 2017 13:27:41 -0000 --089e0822c7a0f2655b055f475437 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable HI Dmitry, I do not think it is good idea to mix failures of different threads into a single event type. - Practice shows that the most common source of problem is exchange thread - if disco worker has died, not will be excluded from topology safely - "nio-acceptor" can be spawn from multiple places where GridNioServer is started, not all of them are ciritical - "grid-nio-worker-tcp-comm" is internal thread which doesn't do any complex processing, so risk of it's crash is minimal We could track most of them, but death of different threads may result in different actions from user side. So I propose to start with exchange thread only for now. Another important point, is that FailureProcessingPolicy should get enough information on what happened in order to decide how to react. E.g., as I explained earlier, IgniteOutOfMemoryException *is not critical error*. Nasty, but not deadly. And node should not be stopped blindly in response to this event. Vladimir. On Fri, Dec 1, 2017 at 3:50 AM, Denis Magda wrote: > Hi Dmitriy, > > I=E2=80=99m totally for the FailureProcessingPolicy addition to > IgniteConfiguration. > > Apart of this, may I ask you to create corresponding documentation ticket= s > for 2.4 release and =E2=80=9Cdocumentation=E2=80=9D component? Only for t= he improvements > that are getting into the next release. Basically you can aggregate them = if > it helps. Feel free to assign the ticket on me right away. > > =E2=80=94 > Denis > > > On Nov 30, 2017, at 10:31 AM, =D0=94=D0=BC=D0=B8=D1=82=D1=80=D0=B8=D0= =B9 =D0=A1=D0=BE=D1=80=D0=BE=D0=BA=D0=B8=D0=BD > wrote: > > > > Hi, Igniters! > > > > We have a set of internal problems, which required graceful node > shutdown, > > or other reaction configured (See discussion thread > > http://apache-ignite-developers.2346864.n4.nabble. > com/Ignite-Enhancement-Proposal-7-Internal-problems-detection-td24460.htm= l > > ): > > - IgniteOutOfMemoryException - > > https://issues.apache.org/jira/browse/IGNITE-6892 > > - Persistence errors - https://issues.apache.org/jira/browse/IGNITE-689= 1 > > - ExchangeWorker exits with error - > > https://issues.apache.org/jira/browse/IGNITE-6890 > > > > First, I propose reconsider 3rd problem as "System worker exit while no= de > > still running (node stopping process has not been started)", because we > > have at least 5 worker classes, which running is critical for node > working. > > > > These workers are: > > - partition-exchanger (ExchangeWorker) > > - disco-event-worker > > - nio-acceptor > > - grid-nio-worker-tcp-comm-* > > - grid-timeout-worker > > > > Second, I propose to use FailureProcessingPolicy (already implemented i= n > > scope of task IGNITE-6890) for reaction definition on 1st and 2nd > detected > > problems too. This policy can be configured similar to SegmentationPoli= cy > > in IgniteConfiguration. > > > > Opinions? > > --089e0822c7a0f2655b055f475437--