ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anton Vinogradov ...@apache.org>
Subject Re: Read Repair (ex. Consistency Check) - review request #2
Date Thu, 04 Jul 2019 09:18:26 GMT
Folks,

Just a minor update.

RunAll [1] with enabled ReadRepair proxy is almost green now (~10 tests
left, started with 6k :)).
During the analisys, I've found some tests with
- unexpected repairs at tx caches
- inconsistent state after the test finished (different entries across the
topology)
For example,
- testInvokeAllAppliedOnceOnBinaryTypeRegistration generates obsolete
versions on backups in case of retry, fixed [2]
- initial cache load generates not equal versions on backups, fixed [3]
- testAccountTxNodeRestart causes unexpected repairs (entries have
different versions), to be investigated.

What's next?

1) Going to merge the solution once "RunAll with ReadRepair enabled"
becomes fully green.
2) Going to add special check after each test which will ensure caches
content after the test is consistent.
2.1) The Same check can (should?) be injected to
awaitPartitionMapExchange() and similar methods.
3) Update Jepsen tests with RR checks.
4) Introduce per partition RR.

So, the final goal is to be sure that Ignite produces only consistent data
and to have a feature to solve consistency in case we gain inconsistent
state somehow.

Limitations?

Currently, RR has some limitations, but they are not related to real
production cases.
In case someone interested to support, for example, MVCC or near caches,
please, feel free to contribute.

[1]
https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
[2]
https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
[3]
https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13

On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av@apache.org> wrote:

> Slava,
>
> >> I will take a look at your pull request if you don't mind.
> Great news!
>
> >> In any way, could you please update the IEP page with the list of
> >> constraints/limitations of the proposed approach, TODOs, etc?
> Not sure we should keep this at IEP until list (#4 from original letter)
> is not confirmed.
>
> Currently, I'm checking RunAll with RR enabled to almost each get request.
> "Almost" means: readRepair = !ctx.readThrough() &&
> ctx.config().getBackups() > 0 && !ctx.isNear() && !ctx.mvccEnabled()
> For now I have 60 failed tests and amount decreasing.
>
> >> For instance, I would like to see all these limitations on the IEP page
> as
> >> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> ticket
> >> in order to track all activities related to `Read Repair` feature.
> Let's do this at merge day to be sure useless issues will not be created.
>
> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> slava.koptilin@gmail.com> wrote:
>
>> Hi Anton,
>>
>> I will take a look at your pull request if you don't mind.
>>
>> In any way, could you please update the IEP page with the list of
>> constraints/limitations of the proposed approach, TODOs, etc?
>> For instance, I would like to see all these limitations on the IEP page as
>> JIRA tickets. Perhaps, it would be good to create an epic/umbrella ticket
>> in order to track all activities related to `Read Repair` feature.
>>
>> Thanks,
>> S.
>>
>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av@apache.org>:
>>
>> > Igniters,
>> > I'm glad to introduce Read Repair feature [0] provides additional
>> > consistency guarantee for Ignite.
>> >
>> > 1) Why we need it?
>> > The detailed explanation can be found at IEP-31 [1].
>> > In short, because of bugs, it's possible to gain an inconsistent state.
>> > We need additional features to handle this case.
>> >
>> > Currently we able to check cluster using Idle_verify [2] feature, but it
>> > will not fix the data, will not even tell which entries are broken.
>> > Read Repair is a feature to understand which entries are broken and to
>> fix
>> > them.
>> >
>> > 1) How it works?
>> > IgniteCache now able to provide special proxy [3] withReadRepair().
>> > This proxy guarantee that data will be gained from all owners and
>> compared.
>> > In the case of consistency violation situation, data will be recovered
>> and
>> > a special event recorded.
>> >
>> > 3) Naming?
>> > Feature name based on Cassandra's Read Repair feature [4], which is
>> pretty
>> > similar.
>> >
>> > 4) Limitations which can be fixed in the future?
>> >   * MVCC and Near caches are not supported.
>> >   * Atomic caches can be checked (false positive case is possible on
>> this
>> > check), but can't be recovered.
>> >   * Partial entry removal can't be recovered.
>> >   * Entries streamed using data streamer (using not a "cache.put" based
>> > updater) and loaded by cache.load
>> >   are perceived as inconsistent since they may have different versions
>> for
>> > same keys.
>> >   * Only explicit get operations are supported (getAndReplace,
>> getAndPut,
>> > etc can be supported in future).
>> >
>> > 5) What's left?
>> >   * SQL/ThinClient/etc support.
>> >   * Metrics (found/repaired).
>> >   * Simple per-partition recovery feature able to work in the
>> background in
>> > addition to per-entry recovery feature.
>> >
>> > 6) Is code checked?
>> >   * Pull Request #5656 [5] (feature) - has green TC.
>> >   * Pull Request #6575 [6] (RunAll with the feature enabled for every
>> get()
>> > request) - has a limited amount of failures (because of data streamer,
>> > cache.load, etc).
>> >
>> > Thoughts?
>> >
>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
>> > [1]
>> >
>> >
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
>> > [2]
>> >
>> >
>> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
>> > [3]
>> >
>> >
>> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
>> > [4]
>> >
>> >
>> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
>> > [5] https://github.com/apache/ignite/pull/5656
>> > [6] https://github.com/apache/ignite/pull/6575
>> >
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message