ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ilya Lantukh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-10058) resetLostPartitions() leaves an additional copy of a partition in the cluster
Date Thu, 06 Dec 2018 15:39:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-10058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711600#comment-16711600

Ilya Lantukh commented on IGNITE-10058:


Thanks for your efforts and willingness to solve this problem!

Unfortunately, our current implementation of partition loss mechanics has a number of complex
flaws, which result in strange behavior. And to solve them we must re-work, re-design and
improve this particular mechanism. Adding hacks to other pieces of code will just make things

To solve this particular issue, I suggest the following:
1. Deprecate PartitionLossPolicy.READ_WRITE_ALL. If we assume that it's possible to modify
data in LOST partitions, we should prepare for very weird scenarios that are impossible to
solve with current architecture.
2. Modify GridDhtPartitionTopologyImpl.resetLostPartitions(...) - it should reset update counters
to 0 only if at the moment when the method was called there was at least one partition owner.
Also, add special logic for the case when all LOST partitions already have update counter
0 - transfer state to OWNING only on affinity nodes.
3. Ensure that resetLostPartitions(...) call always leads to rebalance, and after that all
non-affinity nodes evict their partition instances.

> resetLostPartitions() leaves an additional copy of a partition in the cluster
> -----------------------------------------------------------------------------
>                 Key: IGNITE-10058
>                 URL: https://issues.apache.org/jira/browse/IGNITE-10058
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Stanislav Lukyanov
>            Assignee: Pavel Pereslegin
>            Priority: Major
>             Fix For: 2.8
> If there are several copies of a LOST partition, resetLostPartitions() will leave all
of them in the cluster as OWNING.
> Scenario:
> 1) Start 4 nodes, a cache with backups=0 and READ_WRITE_SAFE, fill the cache
> 2) Stop one node - some partitions are recreated on the remaining nodes as LOST
> 3) Start one node - the LOST partitions are being rebalanced to the new node from the
existing ones
> 4) Wait for rebalance to complete
> 5) Call resetLostPartitions()
> After that the partitions that were LOST become OWNING on all nodes that had them. Eviction
of these partitions doesn't start.
> Need to correctly evict additional copies of LOST partitions either after rebalance on
step 4 or after resetLostPartitions() call on step 5.
> Current resetLostPartitions() implementation does call checkEvictions(), but the ready
affinity assignment contains several nodes per partition for some reason.

This message was sent by Atlassian JIRA

View raw message