Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A9F97200B82 for ; Fri, 16 Sep 2016 15:07:17 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id A89C2160A8C; Fri, 16 Sep 2016 13:07:17 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C6F19160AC4 for ; Fri, 16 Sep 2016 15:07:16 +0200 (CEST) Received: (qmail 45428 invoked by uid 500); 16 Sep 2016 13:07:15 -0000 Mailing-List: contact user-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@ignite.apache.org Delivered-To: mailing list user@ignite.apache.org Received: (qmail 45417 invoked by uid 99); 16 Sep 2016 13:07:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Sep 2016 13:07:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 869D9186475 for ; Fri, 16 Sep 2016 13:07:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.199 X-Spam-Level: * X-Spam-Status: No, score=1.199 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 0_A2S_xUYcCn for ; Fri, 16 Sep 2016 13:07:11 +0000 (UTC) Received: from mail-pa0-f53.google.com (mail-pa0-f53.google.com [209.85.220.53]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 98A1E5FD8D for ; Fri, 16 Sep 2016 13:07:11 +0000 (UTC) Received: by mail-pa0-f53.google.com with SMTP id id6so26076963pad.3 for ; Fri, 16 Sep 2016 06:07:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=c3fR+YUMiCpcRbCUDtSxL4bAMNrY520oekWY2kfuJmU=; b=d1X8+tw2GnD9Yt9exraMztvW0qvGrft30MGh9XkNfYmhyHLHF7o2mJCPyAUbDZzfQI qYOnVKybRpxoxp0pdcilmY5kDXA0BBrAxDFTqI2R+eSc84hTlGoP9BavwsgomdpEGvpX bZKVgmaZMgepbqSvxxftDJE3J1YyQqetb2hs/svJxfy35R/63VoE+Jl/QB6Rac3/Tkg1 XbH6pitHA5u3ppofSJXeMSYirk6DlYDOc3C4LoUUD5FQkLQDBNdHBS9OtrnIqP7+LGiQ zpEgZS7xi8y05ywGtk6PxLoN5YlQBhHkol8gEx9RlZOZoL8UnWxFyaVkTAIAe7V8YLSv vzYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=c3fR+YUMiCpcRbCUDtSxL4bAMNrY520oekWY2kfuJmU=; b=JtvbLDeHEar2jyedFsqsTNWV/3LN/7m36cdvtuG6fJ6nKiidOVnwiQOoobVVwuJ/Av awrSRcxTIajUKoV3MvrJ8i4YEkelAqlE/wz4M3fLCw8WfIb3dKooE23xKFJLTzG26Gne bBFhfYn9B3sAYuxnOnf243UrovP/6lUdigahwNoEKcO/9b5ippgkkRdFgXe0TJrq3Xe0 BHsx1VBNd1gfCCcqFl/4uP9LwnpNzHk7wq5qt4y1TH1LiFIHQFkHhPTkJgoobNVzWw75 VP78CETk1+fJztsp594a+Bzl9EjYfNHADICSLfIn/AjKDOK12nW7Ng/5Zwl2cxg1bcqb pjKA== X-Gm-Message-State: AE9vXwNB3I8i7k4FfWEab5S/hHNqws72fHxWVdp/7PNnqW100S03b556sGms2yNVo/uq55JlvsQnrpyJ+2rMqw== X-Received: by 10.66.249.134 with SMTP id yu6mr23307692pac.44.1474031224530; Fri, 16 Sep 2016 06:07:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.66.82.41 with HTTP; Fri, 16 Sep 2016 06:07:04 -0700 (PDT) In-Reply-To: <1474019049230-7791.post@n6.nabble.com> References: <1465154288337-5432.post@n6.nabble.com> <1471649468110-7183.post@n6.nabble.com> <1471821891944-7199.post@n6.nabble.com> <0BE13FDC-89F1-4060-B935-426834A99BF8@gridgain.com> <1474019049230-7791.post@n6.nabble.com> From: Anmol Rattan Date: Fri, 16 Sep 2016 14:07:04 +0100 Message-ID: Subject: Re: One failing node stalling the whole cluster To: user@ignite.apache.org Content-Type: multipart/alternative; boundary=047d7b15a1d186ae47053c9fa3df archived-at: Fri, 16 Sep 2016 13:07:17 -0000 --047d7b15a1d186ae47053c9fa3df Content-Type: text/plain; charset=UTF-8 That is known error at least in 1.6. I am not sure a fix for this is even in 1.7. For gc pause, if there are actually any, worth considering jvm tuning and seeing allocation and promotion rate. In our case, we had to increase younger gen to have 8GB space to deal with. However, slow client definitely hang whole grid, even if there are no GC, A chicken egg problem results. If you increase timeout, grid hangs for longer time. if your reduce timeout, clients/nodes will leave grid early and even go in segmentation and Segmentation policy handling via starting ignite bean only works if you start process with ignite script. If prcoess has been started otherwise in a custom script, it does not support. Thanks & Regards Anmol Rattan +91 9538901262 On Fri, Sep 16, 2016 at 10:44 AM, yfernando wrote: > Hi Denis, > > We have been able to reproduce this situation where a node failure freezes > the entire grid. > > Please find the full thread dumps of the 5 nodes that are locked up. > > The memoryMode of the caches are configured to be OFFHEAP_TIERED > The cacheMode is PARTITIONED > The atomicityMode is TRANSACTIONAL > > We have also seen ALL the clients freeze during a FULL GC occurring on ANY > single node. > > Please let us know if you require any more information. > > grid-tp1-dev-11220-201609141523318.txt > n7791/grid-tp1-dev-11220-201609141523318.txt> > grid-tp1-dev-11223-201609141523318.txt > n7791/grid-tp1-dev-11223-201609141523318.txt> > grid-tp3-dev-11220-201609141523318.txt > n7791/grid-tp3-dev-11220-201609141523318.txt> > grid-tp3-dev-11221-201609141523318.txt > n7791/grid-tp3-dev-11221-201609141523318.txt> > grid-tp4-dev-11220-201609141523318.txt > n7791/grid-tp4-dev-11220-201609141523318.txt> > > > > > -- > View this message in context: http://apache-ignite-users. > 70518.x6.nabble.com/One-failing-node-stalling-the- > whole-cluster-tp5372p7791.html > Sent from the Apache Ignite Users mailing list archive at Nabble.com. > --047d7b15a1d186ae47053c9fa3df Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
That is known error at least in 1.6. I am not sure a fix f= or this is even in 1.7. For gc pause, if there are actually any, worth cons= idering jvm tuning and seeing allocation and promotion rate.

=
In our case, we had to increase younger gen to have =C2=A08GB space to= deal with.=C2=A0

However, slow client definitely = hang whole grid, even if there are no GC, =C2=A0A chicken egg problem resul= ts. If you increase timeout, grid hangs for longer time.

if your reduce timeout, clients/nodes will leave grid early and even= go in segmentation and Segmentation policy handling via starting ignite be= an only works if you start process with ignite script. If prcoess has been = started otherwise in a custom script, it does not support.

Thanks & RegardsAnmol Rattan
+91 9538901262


On Fri, Sep 16, 2016 at 10:44 AM, yfernando = <yohan.fernando@tudor.com> wrote:
Hi Denis,

We have been able to reproduce this situation where a node failure freezes<= br> the entire grid.

Please find the full thread dumps of the 5 nodes that are locked up.

The memoryMode of the caches are configured to be OFFHEAP_TIERED
The cacheMode is PARTITIONED
The atomicityMode is TRANSACTIONAL

We have also seen ALL the clients freeze during a FULL GC occurring on ANY<= br> single node.

Please let us know if you require any more information.

grid-tp1-dev-11220-201609141523318.txt
<= http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp= 1-dev-11220-201609141523318.txt>
grid-tp1-dev-11223-201609141523318.txt
<= http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp= 1-dev-11223-201609141523318.txt>
grid-tp3-dev-11220-201609141523318.txt
<= http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp= 3-dev-11220-201609141523318.txt>
grid-tp3-dev-11221-201609141523318.txt
<= http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp= 3-dev-11221-201609141523318.txt>
grid-tp4-dev-11220-201609141523318.txt
<= http://apache-ignite-users.70518.x6.nabble.com/file/n7791/grid-tp= 4-dev-11220-201609141523318.txt>




--
View this message in context: http://apache-ignite-users.70518.= x6.nabble.com/One-failing-node-stalling-the-whole-cluster-tp5372p= 7791.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

--047d7b15a1d186ae47053c9fa3df--