ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexey Goncharuk (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (IGNITE-11555) Unable to await partitions release latch caused by coordinator failover
Date Sun, 17 Mar 2019 20:18:00 GMT

     [ https://issues.apache.org/jira/browse/IGNITE-11555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexey Goncharuk updated IGNITE-11555:
--------------------------------------
    Description: 
Currently exchanges latches (both server and client) are deleted when the latch is completed.
This leads to a hang in the following scenario:
1) A grid with several nodes starts exchange latch sync
2) All nodes send acks to coordinator
3) Coordinator processes acks and sends final acks to some of the nodes
4) These nodes process acks, complete and delete client latches
5) Coordinator fails
6) Nodes which did not receive final acks re-send the ack to a new coordinator
7) Since the new coordinator already completed and deleted the client latch, it does not process
re-sent ack correctly and only adds it to the pending messages.

Looks like the root cause of this issue is latch deletion on final ack. We can safely delete
the latch only when all nodes are guaranteed to process the messages. Luckily, since the latch
is tied to the exchange process, we can safely delete the latch when the corresponding exchange
completes.

> Unable to await partitions release latch caused by coordinator failover
> -----------------------------------------------------------------------
>
>                 Key: IGNITE-11555
>                 URL: https://issues.apache.org/jira/browse/IGNITE-11555
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Alexey Goncharuk
>            Priority: Critical
>             Fix For: 2.8
>
>
> Currently exchanges latches (both server and client) are deleted when the latch is completed.
This leads to a hang in the following scenario:
> 1) A grid with several nodes starts exchange latch sync
> 2) All nodes send acks to coordinator
> 3) Coordinator processes acks and sends final acks to some of the nodes
> 4) These nodes process acks, complete and delete client latches
> 5) Coordinator fails
> 6) Nodes which did not receive final acks re-send the ack to a new coordinator
> 7) Since the new coordinator already completed and deleted the client latch, it does
not process re-sent ack correctly and only adds it to the pending messages.
> Looks like the root cause of this issue is latch deletion on final ack. We can safely
delete the latch only when all nodes are guaranteed to process the messages. Luckily, since
the latch is tied to the exchange process, we can safely delete the latch when the corresponding
exchange completes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message