ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Вячеслав Коптилин <slava.kopti...@gmail.com>
Subject Re: Read Repair (ex. Consistency Check) - review request #2
Date Wed, 10 Jul 2019 12:33:33 GMT
Hi Anton,

Well, the ReadRepair feature is finally merged and that is good news :)

Unfortunately, I cannot find a consensus about the whole functionality in
any of these topics:
 -
http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
 -
http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
Also, there are no comments/discussion in JIRA. That makes me sad :(
especially when a feature is huge, not obvious and involves changing public
API (and that is the case, I think).

Anyway, I left some comments in your pull-request at github. Please take a
look. The most annoying thing is poorly documented code :(
By the way, is it required to add test related to fail-over scenarios?

I just checked, the IEP page and I still cannot find Jira tickets that
should cover existing limitations/improvements.
I would suggest creating the following tasks, at least:
 - MVCC support
 - Near caches
 - Additional metrics (number of violations, number of repaired entries etc)
 - Ignite C++ (It looks like, .Net is already aware of that feature)
 - Thin clients support
 - Perhaps, it would be useful to support different strategies to resolve
inconsistencies

What do you think?

Thanks,
S.


ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av@apache.org>:

> Folks,
>
> Thanks to everyone for tips and reviews.
> Yardstick checked, no performance drop found.
> Additional measurement: RR get() is just up to 7% slower than regular get
> on real benchmarks (8 clients, 4 servers, 3 backups)
> Code merged to the master.
> "Must have" tasks created and attached to the IEP.
>
> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av@apache.org> wrote:
>
> > Folks,
> >
> > Just a minor update.
> >
> > RunAll [1] with enabled ReadRepair proxy is almost green now (~10 tests
> > left, started with 6k :)).
> > During the analisys, I've found some tests with
> > - unexpected repairs at tx caches
> > - inconsistent state after the test finished (different entries across
> the
> > topology)
> > For example,
> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates obsolete
> > versions on backups in case of retry, fixed [2]
> > - initial cache load generates not equal versions on backups, fixed [3]
> > - testAccountTxNodeRestart causes unexpected repairs (entries have
> > different versions), to be investigated.
> >
> > What's next?
> >
> > 1) Going to merge the solution once "RunAll with ReadRepair enabled"
> > becomes fully green.
> > 2) Going to add special check after each test which will ensure caches
> > content after the test is consistent.
> > 2.1) The Same check can (should?) be injected to
> > awaitPartitionMapExchange() and similar methods.
> > 3) Update Jepsen tests with RR checks.
> > 4) Introduce per partition RR.
> >
> > So, the final goal is to be sure that Ignite produces only consistent
> data
> > and to have a feature to solve consistency in case we gain inconsistent
> > state somehow.
> >
> > Limitations?
> >
> > Currently, RR has some limitations, but they are not related to real
> > production cases.
> > In case someone interested to support, for example, MVCC or near caches,
> > please, feel free to contribute.
> >
> > [1]
> >
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > [2]
> >
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > [3]
> >
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> >
> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av@apache.org> wrote:
> >
> >> Slava,
> >>
> >> >> I will take a look at your pull request if you don't mind.
> >> Great news!
> >>
> >> >> In any way, could you please update the IEP page with the list of
> >> >> constraints/limitations of the proposed approach, TODOs, etc?
> >> Not sure we should keep this at IEP until list (#4 from original letter)
> >> is not confirmed.
> >>
> >> Currently, I'm checking RunAll with RR enabled to almost each get
> request.
> >> "Almost" means: readRepair = !ctx.readThrough() &&
> >> ctx.config().getBackups() > 0 && !ctx.isNear() && !ctx.mvccEnabled()
> >> For now I have 60 failed tests and amount decreasing.
> >>
> >> >> For instance, I would like to see all these limitations on the IEP
> >> page as
> >> >> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> >> ticket
> >> >> in order to track all activities related to `Read Repair` feature.
> >> Let's do this at merge day to be sure useless issues will not be
> created.
> >>
> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин <
> >> slava.koptilin@gmail.com> wrote:
> >>
> >>> Hi Anton,
> >>>
> >>> I will take a look at your pull request if you don't mind.
> >>>
> >>> In any way, could you please update the IEP page with the list of
> >>> constraints/limitations of the proposed approach, TODOs, etc?
> >>> For instance, I would like to see all these limitations on the IEP page
> >>> as
> >>> JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> ticket
> >>> in order to track all activities related to `Read Repair` feature.
> >>>
> >>> Thanks,
> >>> S.
> >>>
> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av@apache.org>:
> >>>
> >>> > Igniters,
> >>> > I'm glad to introduce Read Repair feature [0] provides additional
> >>> > consistency guarantee for Ignite.
> >>> >
> >>> > 1) Why we need it?
> >>> > The detailed explanation can be found at IEP-31 [1].
> >>> > In short, because of bugs, it's possible to gain an inconsistent
> state.
> >>> > We need additional features to handle this case.
> >>> >
> >>> > Currently we able to check cluster using Idle_verify [2] feature, but
> >>> it
> >>> > will not fix the data, will not even tell which entries are broken.
> >>> > Read Repair is a feature to understand which entries are broken and
> to
> >>> fix
> >>> > them.
> >>> >
> >>> > 1) How it works?
> >>> > IgniteCache now able to provide special proxy [3] withReadRepair().
> >>> > This proxy guarantee that data will be gained from all owners and
> >>> compared.
> >>> > In the case of consistency violation situation, data will be
> recovered
> >>> and
> >>> > a special event recorded.
> >>> >
> >>> > 3) Naming?
> >>> > Feature name based on Cassandra's Read Repair feature [4], which is
> >>> pretty
> >>> > similar.
> >>> >
> >>> > 4) Limitations which can be fixed in the future?
> >>> >   * MVCC and Near caches are not supported.
> >>> >   * Atomic caches can be checked (false positive case is possible on
> >>> this
> >>> > check), but can't be recovered.
> >>> >   * Partial entry removal can't be recovered.
> >>> >   * Entries streamed using data streamer (using not a "cache.put"
> based
> >>> > updater) and loaded by cache.load
> >>> >   are perceived as inconsistent since they may have different
> versions
> >>> for
> >>> > same keys.
> >>> >   * Only explicit get operations are supported (getAndReplace,
> >>> getAndPut,
> >>> > etc can be supported in future).
> >>> >
> >>> > 5) What's left?
> >>> >   * SQL/ThinClient/etc support.
> >>> >   * Metrics (found/repaired).
> >>> >   * Simple per-partition recovery feature able to work in the
> >>> background in
> >>> > addition to per-entry recovery feature.
> >>> >
> >>> > 6) Is code checked?
> >>> >   * Pull Request #5656 [5] (feature) - has green TC.
> >>> >   * Pull Request #6575 [6] (RunAll with the feature enabled for every
> >>> get()
> >>> > request) - has a limited amount of failures (because of data
> streamer,
> >>> > cache.load, etc).
> >>> >
> >>> > Thoughts?
> >>> >
> >>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> >>> > [1]
> >>> >
> >>> >
> >>>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> >>> > [2]
> >>> >
> >>> >
> >>>
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> >>> > [3]
> >>> >
> >>> >
> >>>
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> >>> > [4]
> >>> >
> >>> >
> >>>
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> >>> > [5] https://github.com/apache/ignite/pull/5656
> >>> > [6] https://github.com/apache/ignite/pull/6575
> >>> >
> >>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message