ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexei Scherbakov (JIRA)" <j...@apache.org>
Subject [jira] [Created] (IGNITE-10078) Node failure during concurrent partition updates may cause partition desync between primary and backup.
Date Wed, 31 Oct 2018 07:26:00 GMT
Alexei Scherbakov created IGNITE-10078:
------------------------------------------

             Summary: Node failure during concurrent partition updates may cause partition
desync between primary and backup.
                 Key: IGNITE-10078
                 URL: https://issues.apache.org/jira/browse/IGNITE-10078
             Project: Ignite
          Issue Type: Bug
            Reporter: Alexei Scherbakov
            Assignee: Alexei Scherbakov
             Fix For: 2.8


This is possible if some updates with lower partition counter are not written to WAL before
node failure.

Scenario:

1. Start grid with 3 nodes, 2 backups.
2. Preload some data to partition P.
3. Start two concurrent transactions writing single key to the same partition, keys are different
{noformat}
try(Transaction tx = client.transactions().txStart(PESSIMISTIC, REPEATABLE_READ, 0, 1)) {
      client.cache(DEFAULT_CACHE_NAME).put(k, v);

      tx.commit();
}
{noformat}
4. Order updates on backup in the way such update with max partition counter is written to
WAL and update with lesser partition counter failed due to triggering of FH before it's added
to WAL

5. Return failed node to grid, observe no rebalancing due to same partition counters.

Possible solution: detect gaps in update counters on recovery and force rebalance from a node
without gaps if detected.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message