Return-Path: X-Original-To: apmail-ignite-dev-archive@minotaur.apache.org Delivered-To: apmail-ignite-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 144F2185F7 for ; Sat, 28 Nov 2015 12:37:57 +0000 (UTC) Received: (qmail 27739 invoked by uid 500); 28 Nov 2015 12:37:57 -0000 Delivered-To: apmail-ignite-dev-archive@ignite.apache.org Received: (qmail 27705 invoked by uid 500); 28 Nov 2015 12:37:57 -0000 Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list dev@ignite.apache.org Received: (qmail 27694 invoked by uid 99); 28 Nov 2015 12:37:56 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Nov 2015 12:37:56 +0000 Received: from mail-lf0-f54.google.com (mail-lf0-f54.google.com [209.85.215.54]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 711671A04AF for ; Sat, 28 Nov 2015 12:37:56 +0000 (UTC) Received: by lfaz4 with SMTP id z4so152369253lfa.0 for ; Sat, 28 Nov 2015 04:37:54 -0800 (PST) X-Gm-Message-State: ALoCoQlHYtHhYqlBFKteF3WlfQekJ/4iDD9ArhcblW20PEuK+lQiwFa/12Cw+KU13P2+M4x977Np MIME-Version: 1.0 X-Received: by 10.25.154.9 with SMTP id c9mr17989833lfe.79.1448714274943; Sat, 28 Nov 2015 04:37:54 -0800 (PST) Received: by 10.114.12.35 with HTTP; Sat, 28 Nov 2015 04:37:54 -0800 (PST) Date: Sat, 28 Nov 2015 15:37:54 +0300 X-Gmail-Original-Message-ID: Message-ID: Subject: Communication exception handling From: Yakov Zhdanov To: dev@ignite.apache.org Content-Type: multipart/alternative; boundary=001a11401ecabd5ab90525991370 --001a11401ecabd5ab90525991370 Content-Type: text/plain; charset=UTF-8 Guys, I see the following code (org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129): try { cctx.io().send(n, req, tx.ioPolicy()); } catch (ClusterTopologyCheckedException e) { fut.onNodeLeft(e); } catch (IgniteCheckedException e) { if (!cctx.kernalContext().isStopping()) fut.onResult(e); } Which means that in case if node has just started stop procedure, all cache operations may potentially hang. If cache.put() is called from job and node is stopping gracefully, stop process hangs with 100% probability. This issue does not threaten failure detection and nodes crash cases since this is handled by separate logic. I fixed Communication SPI to use its internal stopping flag instead of the system wide one and this seems to fix the issue with graceful stop. Semyon, can you please see if this may cause any other issue of the kind? My changes are here - https://github.com/apache/ignite/pull/278 --Yakov --001a11401ecabd5ab90525991370--