ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexey Goncharuk (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (IGNITE-11394) Infinite No next node in topology messages during node restart scenario
Date Fri, 22 Feb 2019 14:28:00 GMT

     [ https://issues.apache.org/jira/browse/IGNITE-11394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexey Goncharuk updated IGNITE-11394:
--------------------------------------
    Description: 
I observe a situation with the following symptoms during a cycled nodes restart:
 - A node being joining to the cluster sends join request, receives NodeAddedMessage and awaits
NodeAddFinishedMessage
 - The node receives a metrics update message, the message is in the queue
 - The whole cluster is being restarted, a new ring is formed
 - The node re-sends the join request, it is successfully process by the ring
 - The node added message is received by the joining node
 - The node detects that it cannot send messages (failed nodes contains all ring remote nodes)
 - Sine there was already a metrics update message in the queue, the node attempts to re-add
the message to the queue. Since the metrics update message is a high priority message, it
is added to the head of the queue and the node gets stuck in an infinite loop

I suggest to drop metrics update message in {{sendMessageAcrossRing}} if we see the {{No next
node in topology}} situation.

Another question is why don't we pass the collection of failed nodes to the {{ring.hasRemoteNodes()}}
method.

> Infinite No next node in topology messages during node restart scenario
> -----------------------------------------------------------------------
>
>                 Key: IGNITE-11394
>                 URL: https://issues.apache.org/jira/browse/IGNITE-11394
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Alexey Goncharuk
>            Priority: Major
>
> I observe a situation with the following symptoms during a cycled nodes restart:
>  - A node being joining to the cluster sends join request, receives NodeAddedMessage
and awaits NodeAddFinishedMessage
>  - The node receives a metrics update message, the message is in the queue
>  - The whole cluster is being restarted, a new ring is formed
>  - The node re-sends the join request, it is successfully process by the ring
>  - The node added message is received by the joining node
>  - The node detects that it cannot send messages (failed nodes contains all ring remote
nodes)
>  - Sine there was already a metrics update message in the queue, the node attempts to
re-add the message to the queue. Since the metrics update message is a high priority message,
it is added to the head of the queue and the node gets stuck in an infinite loop
> I suggest to drop metrics update message in {{sendMessageAcrossRing}} if we see the {{No
next node in topology}} situation.
> Another question is why don't we pass the collection of failed nodes to the {{ring.hasRemoteNodes()}}
method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message