ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Kasnacheev <ilya.kasnach...@gmail.com>
Subject Hopeless looping in TcpCommunicationSpi
Date Thu, 31 Aug 2017 16:04:33 GMT
Hello Igniters,

In two weeks there were three times when I've sumbled on looping behavior
of TcpCommunicationSpi.reserveClient(): while (true) {}

One of them, for example, included differing SQL certificates on two nodes,
which led to successful discovery followed by ever-failing communication
(which I fixed). The general problem is that malfunctioning node will never
abandon its attempts to connect, and the rest of cluster will wait forever
for partition map exchange.

Any persisting exception in TcpCommunicationSpi.createTcpClient() will
cause the whole cluster to hang. In degenerate cases it will look like
megabytes of:

[2017-08-31 18:28:20,787][INFO
][grid-nio-worker-tcp-comm-0-#26%server1%][TcpCommunicationSpi] Accepted
incoming communication connection [locAddr=/127.0.0.1:45010, rmtAddr=/
127.0.0.1:33002]
[2017-08-31 18:28:20,988][INFO
][grid-nio-worker-tcp-comm-1-#27%server1%][TcpCommunicationSpi] Accepted
incoming communication connection [locAddr=/127.0.0.1:45010, rmtAddr=/
127.0.0.1:33004]
[2017-08-31 18:28:21,188][INFO
][grid-nio-worker-tcp-comm-0-#26%server1%][TcpCommunicationSpi] Accepted
incoming communication connection [locAddr=/127.0.0.1:45010, rmtAddr=/
127.0.0.1:33006]

This is causing a lot of trouble and therefore I propose to limit
reserveClient() to several attempts, after which a last exception should be
thrown and the node should leave cluster for good.

What do you think?

-- 
Ilya Kasnacheev

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message