ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikolay Izhikov <nizhi...@apache.org>
Subject Re: Read Repair (ex. Consistency Check) - review request #2
Date Mon, 15 Jul 2019 13:41:58 GMT
I did the review.

пн, 15 июля 2019 г., 16:15 Dmitriy Pavlov <dpavlov@apache.org>:

> Igniters, who did a review of
> https://issues.apache.org/jira/browse/IGNITE-10663 before the merge? I've
> checked both PR   https://github.com/apache/ignite/pull/5656  and Issue,
> and dev.list thread and didn't find any LGTM.
>
> Anton, since you've rejected lazy consensus in our process, we have RTC in
> that (core) module. So I'd like to know if the fix was covered by the
> review.
>
> Because you're a committer, a reviewer can be non-committer. So, who was a
> reviewer? Or was process ignored?
>
> пн, 15 июл. 2019 г. в 15:37, Вячеслав Коптилин <slava.koptilin@gmail.com>:
>
> > Hello Anton,
> >
> > > I'd like to propose you to provide fixes as a PR since you have a
> vision
> > of how it should be made. I'll review them and merge shortly.
> > Could you please take a look at PR:
> > https://github.com/apache/ignite/pull/6689
> >
> > > Since your comments mostly about Javadoc (does this mean that my
> solution
> > is so great that you ask me only to fix Javadocs :) ?),
> > In my humble opinion, I would consider this feature as experimental one
> (It
> > does not seem production-ready).
> > Let me clarify this with the following simple example:
> >
> >     try {
> >         atomicCache.withReadRepair().getAll(keys);
> >     }
> >     catch (CacheException e) {
> >         // What should be done here from the end-user point of view?
> >     }
> >
> > 1. Should I consider that my cluster is broken? There is no answer! The
> > false-positive result is possible.
> > 2. What should be done here in order to check/resolve the issue?
> Perhaps, I
> > should restart a node (which one?), restart the whole cluster, put a new
> > value...
> > 3. IgniteConsistencyViolationException is absolutely useless. It does not
> > provide any information about the issue and possible way to fix it.
> >
> > It seems that transactional caches are covered much better.
> >
> > > Mostly agree with you, but
> > > - MVCC is not production ready,
> > > - not sure near support really required,
> > > - metrics are better for monitoring, but the Event is enough for my
> wish
> > to
> > > cover AI with consistency check,
> > > - do we really need Platforms and Thin Client support?
> > Well, near caches are widely used and fully transactional, so I think it
> > makes sense to support the feature for near caches too.
> > .Net is already aware of 'ReadRepair'. It seems to me, that it can be
> > easily supported for C++. I don't see a reason why it should not be done
> :)
> >
> > > Do you mean per partition check and recovery? That's a good idea, but I
> > found it's not easy to imagine API to for such tool.
> > Yep, perhaps it can be done on the idle cluster via `idle-verify` command
> > with additional flag. Agreed, that this approach is not the best one
> > definitely.
> >
> > Thanks,
> > S.
> >
> > чт, 11 июл. 2019 г. в 09:53, Anton Vinogradov <av@apache.org>:
> >
> > > Slava,
> > >
> > > Thanks for your review first!
> > >
> > > >> Anyway, I left some comments in your pull-request at github. Please
> > take
> > > a
> > > >> look. The most annoying thing is poorly documented code :(
> > > Since your comments mostly about Javadoc (does this mean that my
> solution
> > > is so great that you ask me only to fix Javadocs :) ?),
> > > I'd like to propose you to provide fixes as a PR since you have a
> vision
> > of
> > > how it should be made. I'll review them and merge shortly.
> > >
> > > >> By the way, is it required to add test related to fail-over
> scenarios?
> > > The best check is to use RR at real code.
> > > For example, I'm injecting RR now to the test with concurrent
> > modifications
> > > and restarts [1].
> > >
> > > >> I just checked, the IEP page and I still cannot find Jira tickets
> that
> > > >> should cover existing limitations/improvements.
> > > >> I would suggest creating the following tasks, at least:
> > > Mostly agree with you, but
> > > - MVCC is not production ready,
> > > - not sure near support really required,
> > > - metrics are better for monitoring, but the Event is enough for my
> wish
> > to
> > > cover AI with consistency check,
> > > - do we really need Platforms and Thin Client support?
> > > Also, we should not produce stillborn issue.
> > > All limitations listed at proxy creation method and they definitely are
> > not
> > > showstoppers and may be fixed later if someone interested.
> > > Сoming back to AI 3.0 discussion, we have A LOT of features and it's
> > almost
> > > impossible (require much more time that feature's cost) to support them
> > > all.
> > > I will be pretty happy in case someone will do this and provide help if
> > > necessary!
> > >
> > > >> Perhaps, it would be a good idea to think about the recovery too
> > > Do you mean per partition check and recovery?
> > > That's a good idea, but I found it's not easy to imagine API to for
> such
> > > tool.
> > > In case you ready to assist with proper API/design this will definitely
> > > help.
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-11973
> > >
> > > On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <
> > > slava.koptilin@gmail.com>
> > > wrote:
> > >
> > > > Perhaps, it would be a good idea to think about the recovery tool/
> > > > control-utility command that will allow achieving the same goal.
> > > > If I am not mistaken it was already proposed in the email thread.
> > > >
> > > > Thanks,
> > > > S.
> > > >
> > > > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин
<
> > slava.koptilin@gmail.com
> > > >:
> > > >
> > > > > Hi Anton,
> > > > >
> > > > > Well, the ReadRepair feature is finally merged and that is good
> news
> > :)
> > > > >
> > > > > Unfortunately, I cannot find a consensus about the whole
> > functionality
> > > in
> > > > > any of these topics:
> > > > >  -
> > > > >
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > > > >  -
> > > > >
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > > > Also, there are no comments/discussion in JIRA. That makes me sad
> :(
> > > > > especially when a feature is huge, not obvious and involves
> changing
> > > > public
> > > > > API (and that is the case, I think).
> > > > >
> > > > > Anyway, I left some comments in your pull-request at github. Please
> > > take
> > > > a
> > > > > look. The most annoying thing is poorly documented code :(
> > > > > By the way, is it required to add test related to fail-over
> > scenarios?
> > > > >
> > > > > I just checked, the IEP page and I still cannot find Jira tickets
> > that
> > > > > should cover existing limitations/improvements.
> > > > > I would suggest creating the following tasks, at least:
> > > > >  - MVCC support
> > > > >  - Near caches
> > > > >  - Additional metrics (number of violations, number of repaired
> > entries
> > > > > etc)
> > > > >  - Ignite C++ (It looks like, .Net is already aware of that
> feature)
> > > > >  - Thin clients support
> > > > >  - Perhaps, it would be useful to support different strategies to
> > > resolve
> > > > > inconsistencies
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Thanks,
> > > > > S.
> > > > >
> > > > >
> > > > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av@apache.org>:
> > > > >
> > > > >> Folks,
> > > > >>
> > > > >> Thanks to everyone for tips and reviews.
> > > > >> Yardstick checked, no performance drop found.
> > > > >> Additional measurement: RR get() is just up to 7% slower than
> > regular
> > > > get
> > > > >> on real benchmarks (8 clients, 4 servers, 3 backups)
> > > > >> Code merged to the master.
> > > > >> "Must have" tasks created and attached to the IEP.
> > > > >>
> > > > >> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av@apache.org>
> > > wrote:
> > > > >>
> > > > >> > Folks,
> > > > >> >
> > > > >> > Just a minor update.
> > > > >> >
> > > > >> > RunAll [1] with enabled ReadRepair proxy is almost green
now
> (~10
> > > > tests
> > > > >> > left, started with 6k :)).
> > > > >> > During the analisys, I've found some tests with
> > > > >> > - unexpected repairs at tx caches
> > > > >> > - inconsistent state after the test finished (different
entries
> > > across
> > > > >> the
> > > > >> > topology)
> > > > >> > For example,
> > > > >> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates
> > > obsolete
> > > > >> > versions on backups in case of retry, fixed [2]
> > > > >> > - initial cache load generates not equal versions on backups,
> > fixed
> > > > [3]
> > > > >> > - testAccountTxNodeRestart causes unexpected repairs (entries
> have
> > > > >> > different versions), to be investigated.
> > > > >> >
> > > > >> > What's next?
> > > > >> >
> > > > >> > 1) Going to merge the solution once "RunAll with ReadRepair
> > enabled"
> > > > >> > becomes fully green.
> > > > >> > 2) Going to add special check after each test which will
ensure
> > > caches
> > > > >> > content after the test is consistent.
> > > > >> > 2.1) The Same check can (should?) be injected to
> > > > >> > awaitPartitionMapExchange() and similar methods.
> > > > >> > 3) Update Jepsen tests with RR checks.
> > > > >> > 4) Introduce per partition RR.
> > > > >> >
> > > > >> > So, the final goal is to be sure that Ignite produces only
> > > consistent
> > > > >> data
> > > > >> > and to have a feature to solve consistency in case we gain
> > > > inconsistent
> > > > >> > state somehow.
> > > > >> >
> > > > >> > Limitations?
> > > > >> >
> > > > >> > Currently, RR has some limitations, but they are not related
to
> > real
> > > > >> > production cases.
> > > > >> > In case someone interested to support, for example, MVCC
or near
> > > > caches,
> > > > >> > please, feel free to contribute.
> > > > >> >
> > > > >> > [1]
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > > > >> > [2]
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > > > >> > [3]
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > > > >> >
> > > > >> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av@apache.org
> >
> > > > wrote:
> > > > >> >
> > > > >> >> Slava,
> > > > >> >>
> > > > >> >> >> I will take a look at your pull request if
you don't mind.
> > > > >> >> Great news!
> > > > >> >>
> > > > >> >> >> In any way, could you please update the IEP
page with the
> list
> > > of
> > > > >> >> >> constraints/limitations of the proposed approach,
TODOs,
> etc?
> > > > >> >> Not sure we should keep this at IEP until list (#4 from
> original
> > > > >> letter)
> > > > >> >> is not confirmed.
> > > > >> >>
> > > > >> >> Currently, I'm checking RunAll with RR enabled to almost
each
> get
> > > > >> request.
> > > > >> >> "Almost" means: readRepair = !ctx.readThrough() &&
> > > > >> >> ctx.config().getBackups() > 0 && !ctx.isNear()
&&
> > > !ctx.mvccEnabled()
> > > > >> >> For now I have 60 failed tests and amount decreasing.
> > > > >> >>
> > > > >> >> >> For instance, I would like to see all these
limitations on
> the
> > > IEP
> > > > >> >> page as
> > > > >> >> >> JIRA tickets. Perhaps, it would be good to
create an
> > > epic/umbrella
> > > > >> >> ticket
> > > > >> >> >> in order to track all activities related to
`Read Repair`
> > > feature.
> > > > >> >> Let's do this at merge day to be sure useless issues
will not
> be
> > > > >> created.
> > > > >> >>
> > > > >> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин
<
> > > > >> >> slava.koptilin@gmail.com> wrote:
> > > > >> >>
> > > > >> >>> Hi Anton,
> > > > >> >>>
> > > > >> >>> I will take a look at your pull request if you don't
mind.
> > > > >> >>>
> > > > >> >>> In any way, could you please update the IEP page
with the list
> > of
> > > > >> >>> constraints/limitations of the proposed approach,
TODOs, etc?
> > > > >> >>> For instance, I would like to see all these limitations
on the
> > IEP
> > > > >> page
> > > > >> >>> as
> > > > >> >>> JIRA tickets. Perhaps, it would be good to create
an
> > epic/umbrella
> > > > >> ticket
> > > > >> >>> in order to track all activities related to `Read
Repair`
> > feature.
> > > > >> >>>
> > > > >> >>> Thanks,
> > > > >> >>> S.
> > > > >> >>>
> > > > >> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov
<av@apache.org
> >:
> > > > >> >>>
> > > > >> >>> > Igniters,
> > > > >> >>> > I'm glad to introduce Read Repair feature [0]
provides
> > > additional
> > > > >> >>> > consistency guarantee for Ignite.
> > > > >> >>> >
> > > > >> >>> > 1) Why we need it?
> > > > >> >>> > The detailed explanation can be found at IEP-31
[1].
> > > > >> >>> > In short, because of bugs, it's possible to
gain an
> > inconsistent
> > > > >> state.
> > > > >> >>> > We need additional features to handle this
case.
> > > > >> >>> >
> > > > >> >>> > Currently we able to check cluster using Idle_verify
[2]
> > > feature,
> > > > >> but
> > > > >> >>> it
> > > > >> >>> > will not fix the data, will not even tell which
entries are
> > > > broken.
> > > > >> >>> > Read Repair is a feature to understand which
entries are
> > broken
> > > > and
> > > > >> to
> > > > >> >>> fix
> > > > >> >>> > them.
> > > > >> >>> >
> > > > >> >>> > 1) How it works?
> > > > >> >>> > IgniteCache now able to provide special proxy
[3]
> > > > withReadRepair().
> > > > >> >>> > This proxy guarantee that data will be gained
from all
> owners
> > > and
> > > > >> >>> compared.
> > > > >> >>> > In the case of consistency violation situation,
data will be
> > > > >> recovered
> > > > >> >>> and
> > > > >> >>> > a special event recorded.
> > > > >> >>> >
> > > > >> >>> > 3) Naming?
> > > > >> >>> > Feature name based on Cassandra's Read Repair
feature [4],
> > which
> > > > is
> > > > >> >>> pretty
> > > > >> >>> > similar.
> > > > >> >>> >
> > > > >> >>> > 4) Limitations which can be fixed in the future?
> > > > >> >>> >   * MVCC and Near caches are not supported.
> > > > >> >>> >   * Atomic caches can be checked (false positive
case is
> > > possible
> > > > on
> > > > >> >>> this
> > > > >> >>> > check), but can't be recovered.
> > > > >> >>> >   * Partial entry removal can't be recovered.
> > > > >> >>> >   * Entries streamed using data streamer (using
not a
> > > "cache.put"
> > > > >> based
> > > > >> >>> > updater) and loaded by cache.load
> > > > >> >>> >   are perceived as inconsistent since they
may have
> different
> > > > >> versions
> > > > >> >>> for
> > > > >> >>> > same keys.
> > > > >> >>> >   * Only explicit get operations are supported
> (getAndReplace,
> > > > >> >>> getAndPut,
> > > > >> >>> > etc can be supported in future).
> > > > >> >>> >
> > > > >> >>> > 5) What's left?
> > > > >> >>> >   * SQL/ThinClient/etc support.
> > > > >> >>> >   * Metrics (found/repaired).
> > > > >> >>> >   * Simple per-partition recovery feature able
to work in
> the
> > > > >> >>> background in
> > > > >> >>> > addition to per-entry recovery feature.
> > > > >> >>> >
> > > > >> >>> > 6) Is code checked?
> > > > >> >>> >   * Pull Request #5656 [5] (feature) - has
green TC.
> > > > >> >>> >   * Pull Request #6575 [6] (RunAll with the
feature enabled
> > for
> > > > >> every
> > > > >> >>> get()
> > > > >> >>> > request) - has a limited amount of failures
(because of data
> > > > >> streamer,
> > > > >> >>> > cache.load, etc).
> > > > >> >>> >
> > > > >> >>> > Thoughts?
> > > > >> >>> >
> > > > >> >>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> > > > >> >>> > [1]
> > > > >> >>> >
> > > > >> >>> >
> > > > >> >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > > > >> >>> > [2]
> > > > >> >>> >
> > > > >> >>> >
> > > > >> >>>
> > > > >>
> > > >
> > >
> >
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > > >> >>> > [3]
> > > > >> >>> >
> > > > >> >>> >
> > > > >> >>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > > > >> >>> > [4]
> > > > >> >>> >
> > > > >> >>> >
> > > > >> >>>
> > > > >>
> > > >
> > >
> >
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > > > >> >>> > [5] https://github.com/apache/ignite/pull/5656
> > > > >> >>> > [6] https://github.com/apache/ignite/pull/6575
> > > > >> >>> >
> > > > >> >>>
> > > > >> >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message