ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexei Scherbakov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-10078) Node failure during concurrent partition updates may cause partition desync between primary and backup.
Date Fri, 17 May 2019 10:58:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16842091#comment-16842091
] 

Alexei Scherbakov commented on IGNITE-10078:
--------------------------------------------

Contribution seems to be ready for merging.

> Node failure during concurrent partition updates may cause partition desync between primary
and backup.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-10078
>                 URL: https://issues.apache.org/jira/browse/IGNITE-10078
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Alexei Scherbakov
>            Assignee: Alexei Scherbakov
>            Priority: Major
>             Fix For: 2.8
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This is possible if some updates are not written to WAL before node failure. They will
be not applied by rebalancing due to same partition counters in certain scenario:
> 1. Start grid with 3 nodes, 2 backups.
> 2. Preload some data to partition P.
> 3. Start two concurrent transactions writing single key to the same partition P, keys
are different
> {noformat}
> try(Transaction tx = client.transactions().txStart(PESSIMISTIC, REPEATABLE_READ, 0, 1))
{
>       client.cache(DEFAULT_CACHE_NAME).put(k, v);
>       tx.commit();
> }
> {noformat}
> 4. Order updates on backup in the way such update with max partition counter is written
to WAL and update with lesser partition counter failed due to triggering of FH before it's
added to WAL
> 5. Return failed node to grid, observe no rebalancing due to same partition counters.
> Possible solution: detect gaps in update counters on recovery and force rebalance from
a node without gaps if detected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message