ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Вячеслав Коптилин <slava.kopti...@gmail.com>
Subject Re: Read Repair (ex. Consistency Check) - review request #2
Date Wed, 10 Jul 2019 12:43:22 GMT
Perhaps, it would be a good idea to think about the recovery tool/
control-utility command that will allow achieving the same goal.
If I am not mistaken it was already proposed in the email thread.

Thanks,
S.

ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <slava.koptilin@gmail.com>:

> Hi Anton,
>
> Well, the ReadRepair feature is finally merged and that is good news :)
>
> Unfortunately, I cannot find a consensus about the whole functionality in
> any of these topics:
>  -
> http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
>  -
> http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> Also, there are no comments/discussion in JIRA. That makes me sad :(
> especially when a feature is huge, not obvious and involves changing public
> API (and that is the case, I think).
>
> Anyway, I left some comments in your pull-request at github. Please take a
> look. The most annoying thing is poorly documented code :(
> By the way, is it required to add test related to fail-over scenarios?
>
> I just checked, the IEP page and I still cannot find Jira tickets that
> should cover existing limitations/improvements.
> I would suggest creating the following tasks, at least:
>  - MVCC support
>  - Near caches
>  - Additional metrics (number of violations, number of repaired entries
> etc)
>  - Ignite C++ (It looks like, .Net is already aware of that feature)
>  - Thin clients support
>  - Perhaps, it would be useful to support different strategies to resolve
> inconsistencies
>
> What do you think?
>
> Thanks,
> S.
>
>
> ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av@apache.org>:
>
>> Folks,
>>
>> Thanks to everyone for tips and reviews.
>> Yardstick checked, no performance drop found.
>> Additional measurement: RR get() is just up to 7% slower than regular get
>> on real benchmarks (8 clients, 4 servers, 3 backups)
>> Code merged to the master.
>> "Must have" tasks created and attached to the IEP.
>>
>> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av@apache.org> wrote:
>>
>> > Folks,
>> >
>> > Just a minor update.
>> >
>> > RunAll [1] with enabled ReadRepair proxy is almost green now (~10 tests
>> > left, started with 6k :)).
>> > During the analisys, I've found some tests with
>> > - unexpected repairs at tx caches
>> > - inconsistent state after the test finished (different entries across
>> the
>> > topology)
>> > For example,
>> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates obsolete
>> > versions on backups in case of retry, fixed [2]
>> > - initial cache load generates not equal versions on backups, fixed [3]
>> > - testAccountTxNodeRestart causes unexpected repairs (entries have
>> > different versions), to be investigated.
>> >
>> > What's next?
>> >
>> > 1) Going to merge the solution once "RunAll with ReadRepair enabled"
>> > becomes fully green.
>> > 2) Going to add special check after each test which will ensure caches
>> > content after the test is consistent.
>> > 2.1) The Same check can (should?) be injected to
>> > awaitPartitionMapExchange() and similar methods.
>> > 3) Update Jepsen tests with RR checks.
>> > 4) Introduce per partition RR.
>> >
>> > So, the final goal is to be sure that Ignite produces only consistent
>> data
>> > and to have a feature to solve consistency in case we gain inconsistent
>> > state somehow.
>> >
>> > Limitations?
>> >
>> > Currently, RR has some limitations, but they are not related to real
>> > production cases.
>> > In case someone interested to support, for example, MVCC or near caches,
>> > please, feel free to contribute.
>> >
>> > [1]
>> >
>> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
>> > [2]
>> >
>> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
>> > [3]
>> >
>> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
>> >
>> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av@apache.org> wrote:
>> >
>> >> Slava,
>> >>
>> >> >> I will take a look at your pull request if you don't mind.
>> >> Great news!
>> >>
>> >> >> In any way, could you please update the IEP page with the list
of
>> >> >> constraints/limitations of the proposed approach, TODOs, etc?
>> >> Not sure we should keep this at IEP until list (#4 from original
>> letter)
>> >> is not confirmed.
>> >>
>> >> Currently, I'm checking RunAll with RR enabled to almost each get
>> request.
>> >> "Almost" means: readRepair = !ctx.readThrough() &&
>> >> ctx.config().getBackups() > 0 && !ctx.isNear() && !ctx.mvccEnabled()
>> >> For now I have 60 failed tests and amount decreasing.
>> >>
>> >> >> For instance, I would like to see all these limitations on the
IEP
>> >> page as
>> >> >> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
>> >> ticket
>> >> >> in order to track all activities related to `Read Repair` feature.
>> >> Let's do this at merge day to be sure useless issues will not be
>> created.
>> >>
>> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
>> >> slava.koptilin@gmail.com> wrote:
>> >>
>> >>> Hi Anton,
>> >>>
>> >>> I will take a look at your pull request if you don't mind.
>> >>>
>> >>> In any way, could you please update the IEP page with the list of
>> >>> constraints/limitations of the proposed approach, TODOs, etc?
>> >>> For instance, I would like to see all these limitations on the IEP
>> page
>> >>> as
>> >>> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
>> ticket
>> >>> in order to track all activities related to `Read Repair` feature.
>> >>>
>> >>> Thanks,
>> >>> S.
>> >>>
>> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av@apache.org>:
>> >>>
>> >>> > Igniters,
>> >>> > I'm glad to introduce Read Repair feature [0] provides additional
>> >>> > consistency guarantee for Ignite.
>> >>> >
>> >>> > 1) Why we need it?
>> >>> > The detailed explanation can be found at IEP-31 [1].
>> >>> > In short, because of bugs, it's possible to gain an inconsistent
>> state.
>> >>> > We need additional features to handle this case.
>> >>> >
>> >>> > Currently we able to check cluster using Idle_verify [2] feature,
>> but
>> >>> it
>> >>> > will not fix the data, will not even tell which entries are broken.
>> >>> > Read Repair is a feature to understand which entries are broken
and
>> to
>> >>> fix
>> >>> > them.
>> >>> >
>> >>> > 1) How it works?
>> >>> > IgniteCache now able to provide special proxy [3] withReadRepair().
>> >>> > This proxy guarantee that data will be gained from all owners and
>> >>> compared.
>> >>> > In the case of consistency violation situation, data will be
>> recovered
>> >>> and
>> >>> > a special event recorded.
>> >>> >
>> >>> > 3) Naming?
>> >>> > Feature name based on Cassandra's Read Repair feature [4], which
is
>> >>> pretty
>> >>> > similar.
>> >>> >
>> >>> > 4) Limitations which can be fixed in the future?
>> >>> >   * MVCC and Near caches are not supported.
>> >>> >   * Atomic caches can be checked (false positive case is possible
on
>> >>> this
>> >>> > check), but can't be recovered.
>> >>> >   * Partial entry removal can't be recovered.
>> >>> >   * Entries streamed using data streamer (using not a "cache.put"
>> based
>> >>> > updater) and loaded by cache.load
>> >>> >   are perceived as inconsistent since they may have different
>> versions
>> >>> for
>> >>> > same keys.
>> >>> >   * Only explicit get operations are supported (getAndReplace,
>> >>> getAndPut,
>> >>> > etc can be supported in future).
>> >>> >
>> >>> > 5) What's left?
>> >>> >   * SQL/ThinClient/etc support.
>> >>> >   * Metrics (found/repaired).
>> >>> >   * Simple per-partition recovery feature able to work in the
>> >>> background in
>> >>> > addition to per-entry recovery feature.
>> >>> >
>> >>> > 6) Is code checked?
>> >>> >   * Pull Request #5656 [5] (feature) - has green TC.
>> >>> >   * Pull Request #6575 [6] (RunAll with the feature enabled for
>> every
>> >>> get()
>> >>> > request) - has a limited amount of failures (because of data
>> streamer,
>> >>> > cache.load, etc).
>> >>> >
>> >>> > Thoughts?
>> >>> >
>> >>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
>> >>> > [1]
>> >>> >
>> >>> >
>> >>>
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
>> >>> > [2]
>> >>> >
>> >>> >
>> >>>
>> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
>> >>> > [3]
>> >>> >
>> >>> >
>> >>>
>> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
>> >>> > [4]
>> >>> >
>> >>> >
>> >>>
>> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
>> >>> > [5] https://github.com/apache/ignite/pull/5656
>> >>> > [6] https://github.com/apache/ignite/pull/6575
>> >>> >
>> >>>
>> >>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message