ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Mashenkov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-6256) When a node becomes segmented an AssertionError is thrown during GridDhtPartitionTopologyImpl.removeNode
Date Wed, 06 Sep 2017 09:46:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-6256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155099#comment-16155099
] 

Andrew Mashenkov commented on IGNITE-6256:
------------------------------------------

Seems, this bug was introduced by IGNITE-4779.

> When a node becomes segmented an AssertionError is thrown during GridDhtPartitionTopologyImpl.removeNode
> --------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-6256
>                 URL: https://issues.apache.org/jira/browse/IGNITE-6256
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.8
>            Reporter: Alexandr Fedotov
>            Assignee: Andrew Mashenkov
>             Fix For: 2.3
>
>
> The assert is as follows:
> exception="java.lang.AssertionError: null
>  at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.removeNode(GridDhtPartitionTopologyImpl.java:1422)
>  at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl.beforeExchange(GridDhtPartitionTopologyImpl.java:490)
>  at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.distributedExchange(GridDhtPartitionsExchangeFuture.java:769)
>  at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:504)
>  at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:1689)
>  at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>  at java.lang.Thread.run(Thread.java:745)
> Below is the sequence of steps that leads to the assertion error:
> 1) A node becomes SEGMENTED when it's determined by SegmentCheckWorker, after an EVT_NODE_FAILED
has been received.
> 2) It gets visibleRemoteNodes from it's TcpDiscoveryNodesRing
> 3) Clears the TcpDiscoveryNodesRing leaving only self on the list. The node ring is used
to determine if a node is alive
> during DiscoCache creation
> 4) After that, the node initiates removal of all the nodes read in step 2
> 5) For each node, it sends an EVT_NODE_FAILED to the corresponding DiscoverySpiListener
> providing a topology containing all the nodes except already processed
> 6) This event gets into GridDiscoveryManager 
> 7) The node gets removed from alive nodes for every DiscoCache in discoCacheHist
> 8) Topology change is detected
> 9) Creation of a new DiscoCache is attempted. At this moment every remote node is not
available due to the
> TcpDiscoveryNodesRing has been cleared, thus resulting in a DiscoCache with empty alives
> 10) The event with the created DiscoCache and the new topology version is passed to DiscoveryWorker
> 11) The event is eventually handled by DiscoveryWorker and is recorded by DiscoveryWorker#recordEvent
> 12) The recording is handled by GridEventStorageManager which notifies every listener
for this event type (EVT_NODE_FAILED)
> 13) One of the listeners is GridCachePartitionExchangeManager#discoLsnr
> It creates a new GridDhtPartitionsExchangeFuture with the empty DiscoCache received with
the event and enqueues it
> 14) The future gets eventually handled by GridDhtPartitionsExchangeFuture and initialized
> 15) updateTopologies is called, which for each GridCacheContext gets its topology (GridDhtPartitionTopology)
> and calls GridDhtPartitionTopology#updateTopologyVersion
> 16) DiscoCache for GridDhtPartitionTopology is assigned from the one of the GridDhtPartitionsExchangeFuture.
> The assigned DiscoCache has empty alives at the moment
> 15) A distributed exchange is handled (GridDhtPartitionsExchangeFuture#distributedExchange)
> 16) For each cache context GridCacheContext, for its topology (GridDhtPartitionTopologyImpl)
GridDhtPartitionTopologyImpl#beforeExchange is called
> 17) The fact that the node has left is determined and GridDhtPartitionTopologyImpl#removeNode
is called to handle it
> 18) An attempt is made to get the alive coordinator node by calling DiscoCache#oldestAliveServerNode
> 19) null is returned which results in an AssertionError
> The fix should probably prevent initiating exchange futures if a node has segmented.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message