Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@ignite.apache.org
MIME-Version: 1.0
Date: Sat, 28 Nov 2015 15:37:54 +0300
Message-ID: 
 <CAGcMBHjbS+kr8g_EA+TO_dDMBCvztCSTw4=Pn+NPcZd4ocLj7w@mail.gmail.com>
Subject: Communication exception handling
From: Yakov Zhdanov <yzhdanov@apache.org>
To: dev@ignite.apache.org
Content-Type: multipart/alternative; boundary=001a11401ecabd5ab90525991370

--001a11401ecabd5ab90525991370
Content-Type: text/plain; charset=UTF-8

Guys,

I see the following code
(org/apache/ignite/internal/processors/cache/distributed/dht/GridDhtTxPrepareFuture.java:1129):

                    try {
                        cctx.io().send(n, req, tx.ioPolicy());
                    }
                    catch (ClusterTopologyCheckedException e) {
                        fut.onNodeLeft(e);
                    }
                    catch (IgniteCheckedException e) {
                        if (!cctx.kernalContext().isStopping())
                            fut.onResult(e);
                    }


Which means that in case if node has just started stop procedure, all cache
operations may potentially hang. If cache.put() is called from job and node
is stopping gracefully, stop process hangs with 100% probability.

This issue does not threaten failure detection and nodes crash cases since
this is handled by separate logic.

I fixed Communication SPI to use its internal stopping flag instead of the
system wide one and this seems to fix the issue with graceful stop.

Semyon, can you please see if this may cause any other issue of the kind?

My changes are here - https://github.com/apache/ignite/pull/278

--Yakov

--001a11401ecabd5ab90525991370--