ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Pavlov <dpav...@apache.org>
Subject Re: Read Repair (ex. Consistency Check) - review request #2
Date Mon, 15 Jul 2019 13:51:50 GMT
Ok,  thank you

пн, 15 июл. 2019 г., 16:46 Nikolay Izhikov <nizhikov@apache.org>:

> I did the review.
>
> пн, 15 июля 2019 г., 16:15 Dmitriy Pavlov <dpavlov@apache.org>:
>
> > Igniters, who did a review of
> > https://issues.apache.org/jira/browse/IGNITE-10663 before the merge?
> I've
> > checked both PR   https://github.com/apache/ignite/pull/5656  and Issue,
> > and dev.list thread and didn't find any LGTM.
> >
> > Anton, since you've rejected lazy consensus in our process, we have RTC
> in
> > that (core) module. So I'd like to know if the fix was covered by the
> > review.
> >
> > Because you're a committer, a reviewer can be non-committer. So, who was
> a
> > reviewer? Or was process ignored?
> >
> > пн, 15 июл. 2019 г. в 15:37, Вячеслав Коптилин <slava.koptilin@gmail.com
> >:
> >
> > > Hello Anton,
> > >
> > > > I'd like to propose you to provide fixes as a PR since you have a
> > vision
> > > of how it should be made. I'll review them and merge shortly.
> > > Could you please take a look at PR:
> > > https://github.com/apache/ignite/pull/6689
> > >
> > > > Since your comments mostly about Javadoc (does this mean that my
> > solution
> > > is so great that you ask me only to fix Javadocs :) ?),
> > > In my humble opinion, I would consider this feature as experimental one
> > (It
> > > does not seem production-ready).
> > > Let me clarify this with the following simple example:
> > >
> > >     try {
> > >         atomicCache.withReadRepair().getAll(keys);
> > >     }
> > >     catch (CacheException e) {
> > >         // What should be done here from the end-user point of view?
> > >     }
> > >
> > > 1. Should I consider that my cluster is broken? There is no answer! The
> > > false-positive result is possible.
> > > 2. What should be done here in order to check/resolve the issue?
> > Perhaps, I
> > > should restart a node (which one?), restart the whole cluster, put a
> new
> > > value...
> > > 3. IgniteConsistencyViolationException is absolutely useless. It does
> not
> > > provide any information about the issue and possible way to fix it.
> > >
> > > It seems that transactional caches are covered much better.
> > >
> > > > Mostly agree with you, but
> > > > - MVCC is not production ready,
> > > > - not sure near support really required,
> > > > - metrics are better for monitoring, but the Event is enough for my
> > wish
> > > to
> > > > cover AI with consistency check,
> > > > - do we really need Platforms and Thin Client support?
> > > Well, near caches are widely used and fully transactional, so I think
> it
> > > makes sense to support the feature for near caches too.
> > > .Net is already aware of 'ReadRepair'. It seems to me, that it can be
> > > easily supported for C++. I don't see a reason why it should not be
> done
> > :)
> > >
> > > > Do you mean per partition check and recovery? That's a good idea,
> but I
> > > found it's not easy to imagine API to for such tool.
> > > Yep, perhaps it can be done on the idle cluster via `idle-verify`
> command
> > > with additional flag. Agreed, that this approach is not the best one
> > > definitely.
> > >
> > > Thanks,
> > > S.
> > >
> > > чт, 11 июл. 2019 г. в 09:53, Anton Vinogradov <av@apache.org>:
> > >
> > > > Slava,
> > > >
> > > > Thanks for your review first!
> > > >
> > > > >> Anyway, I left some comments in your pull-request at github.
> Please
> > > take
> > > > a
> > > > >> look. The most annoying thing is poorly documented code :(
> > > > Since your comments mostly about Javadoc (does this mean that my
> > solution
> > > > is so great that you ask me only to fix Javadocs :) ?),
> > > > I'd like to propose you to provide fixes as a PR since you have a
> > vision
> > > of
> > > > how it should be made. I'll review them and merge shortly.
> > > >
> > > > >> By the way, is it required to add test related to fail-over
> > scenarios?
> > > > The best check is to use RR at real code.
> > > > For example, I'm injecting RR now to the test with concurrent
> > > modifications
> > > > and restarts [1].
> > > >
> > > > >> I just checked, the IEP page and I still cannot find Jira tickets
> > that
> > > > >> should cover existing limitations/improvements.
> > > > >> I would suggest creating the following tasks, at least:
> > > > Mostly agree with you, but
> > > > - MVCC is not production ready,
> > > > - not sure near support really required,
> > > > - metrics are better for monitoring, but the Event is enough for my
> > wish
> > > to
> > > > cover AI with consistency check,
> > > > - do we really need Platforms and Thin Client support?
> > > > Also, we should not produce stillborn issue.
> > > > All limitations listed at proxy creation method and they definitely
> are
> > > not
> > > > showstoppers and may be fixed later if someone interested.
> > > > Сoming back to AI 3.0 discussion, we have A LOT of features and it's
> > > almost
> > > > impossible (require much more time that feature's cost) to support
> them
> > > > all.
> > > > I will be pretty happy in case someone will do this and provide help
> if
> > > > necessary!
> > > >
> > > > >> Perhaps, it would be a good idea to think about the recovery
too
> > > > Do you mean per partition check and recovery?
> > > > That's a good idea, but I found it's not easy to imagine API to for
> > such
> > > > tool.
> > > > In case you ready to assist with proper API/design this will
> definitely
> > > > help.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-11973
> > > >
> > > > On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <
> > > > slava.koptilin@gmail.com>
> > > > wrote:
> > > >
> > > > > Perhaps, it would be a good idea to think about the recovery tool/
> > > > > control-utility command that will allow achieving the same goal.
> > > > > If I am not mistaken it was already proposed in the email thread.
> > > > >
> > > > > Thanks,
> > > > > S.
> > > > >
> > > > > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин
<
> > > slava.koptilin@gmail.com
> > > > >:
> > > > >
> > > > > > Hi Anton,
> > > > > >
> > > > > > Well, the ReadRepair feature is finally merged and that is good
> > news
> > > :)
> > > > > >
> > > > > > Unfortunately, I cannot find a consensus about the whole
> > > functionality
> > > > in
> > > > > > any of these topics:
> > > > > >  -
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > > > > >  -
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > > > > Also, there are no comments/discussion in JIRA. That makes me
sad
> > :(
> > > > > > especially when a feature is huge, not obvious and involves
> > changing
> > > > > public
> > > > > > API (and that is the case, I think).
> > > > > >
> > > > > > Anyway, I left some comments in your pull-request at github.
> Please
> > > > take
> > > > > a
> > > > > > look. The most annoying thing is poorly documented code :(
> > > > > > By the way, is it required to add test related to fail-over
> > > scenarios?
> > > > > >
> > > > > > I just checked, the IEP page and I still cannot find Jira tickets
> > > that
> > > > > > should cover existing limitations/improvements.
> > > > > > I would suggest creating the following tasks, at least:
> > > > > >  - MVCC support
> > > > > >  - Near caches
> > > > > >  - Additional metrics (number of violations, number of repaired
> > > entries
> > > > > > etc)
> > > > > >  - Ignite C++ (It looks like, .Net is already aware of that
> > feature)
> > > > > >  - Thin clients support
> > > > > >  - Perhaps, it would be useful to support different strategies
to
> > > > resolve
> > > > > > inconsistencies
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > Thanks,
> > > > > > S.
> > > > > >
> > > > > >
> > > > > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av@apache.org>:
> > > > > >
> > > > > >> Folks,
> > > > > >>
> > > > > >> Thanks to everyone for tips and reviews.
> > > > > >> Yardstick checked, no performance drop found.
> > > > > >> Additional measurement: RR get() is just up to 7% slower
than
> > > regular
> > > > > get
> > > > > >> on real benchmarks (8 clients, 4 servers, 3 backups)
> > > > > >> Code merged to the master.
> > > > > >> "Must have" tasks created and attached to the IEP.
> > > > > >>
> > > > > >> On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av@apache.org
> >
> > > > wrote:
> > > > > >>
> > > > > >> > Folks,
> > > > > >> >
> > > > > >> > Just a minor update.
> > > > > >> >
> > > > > >> > RunAll [1] with enabled ReadRepair proxy is almost
green now
> > (~10
> > > > > tests
> > > > > >> > left, started with 6k :)).
> > > > > >> > During the analisys, I've found some tests with
> > > > > >> > - unexpected repairs at tx caches
> > > > > >> > - inconsistent state after the test finished (different
> entries
> > > > across
> > > > > >> the
> > > > > >> > topology)
> > > > > >> > For example,
> > > > > >> > - testInvokeAllAppliedOnceOnBinaryTypeRegistration
generates
> > > > obsolete
> > > > > >> > versions on backups in case of retry, fixed [2]
> > > > > >> > - initial cache load generates not equal versions on
backups,
> > > fixed
> > > > > [3]
> > > > > >> > - testAccountTxNodeRestart causes unexpected repairs
(entries
> > have
> > > > > >> > different versions), to be investigated.
> > > > > >> >
> > > > > >> > What's next?
> > > > > >> >
> > > > > >> > 1) Going to merge the solution once "RunAll with ReadRepair
> > > enabled"
> > > > > >> > becomes fully green.
> > > > > >> > 2) Going to add special check after each test which
will
> ensure
> > > > caches
> > > > > >> > content after the test is consistent.
> > > > > >> > 2.1) The Same check can (should?) be injected to
> > > > > >> > awaitPartitionMapExchange() and similar methods.
> > > > > >> > 3) Update Jepsen tests with RR checks.
> > > > > >> > 4) Introduce per partition RR.
> > > > > >> >
> > > > > >> > So, the final goal is to be sure that Ignite produces
only
> > > > consistent
> > > > > >> data
> > > > > >> > and to have a feature to solve consistency in case
we gain
> > > > > inconsistent
> > > > > >> > state somehow.
> > > > > >> >
> > > > > >> > Limitations?
> > > > > >> >
> > > > > >> > Currently, RR has some limitations, but they are not
related
> to
> > > real
> > > > > >> > production cases.
> > > > > >> > In case someone interested to support, for example,
MVCC or
> near
> > > > > caches,
> > > > > >> > please, feel free to contribute.
> > > > > >> >
> > > > > >> > [1]
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > > > > >> > [2]
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > > > > >> > [3]
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > > > > >> >
> > > > > >> > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <
> av@apache.org
> > >
> > > > > wrote:
> > > > > >> >
> > > > > >> >> Slava,
> > > > > >> >>
> > > > > >> >> >> I will take a look at your pull request
if you don't mind.
> > > > > >> >> Great news!
> > > > > >> >>
> > > > > >> >> >> In any way, could you please update the
IEP page with the
> > list
> > > > of
> > > > > >> >> >> constraints/limitations of the proposed
approach, TODOs,
> > etc?
> > > > > >> >> Not sure we should keep this at IEP until list
(#4 from
> > original
> > > > > >> letter)
> > > > > >> >> is not confirmed.
> > > > > >> >>
> > > > > >> >> Currently, I'm checking RunAll with RR enabled
to almost each
> > get
> > > > > >> request.
> > > > > >> >> "Almost" means: readRepair = !ctx.readThrough()
&&
> > > > > >> >> ctx.config().getBackups() > 0 && !ctx.isNear()
&&
> > > > !ctx.mvccEnabled()
> > > > > >> >> For now I have 60 failed tests and amount decreasing.
> > > > > >> >>
> > > > > >> >> >> For instance, I would like to see all
these limitations on
> > the
> > > > IEP
> > > > > >> >> page as
> > > > > >> >> >> JIRA tickets. Perhaps, it would be good
to create an
> > > > epic/umbrella
> > > > > >> >> ticket
> > > > > >> >> >> in order to track all activities related
to `Read Repair`
> > > > feature.
> > > > > >> >> Let's do this at merge day to be sure useless issues
will not
> > be
> > > > > >> created.
> > > > > >> >>
> > > > > >> >> On Fri, Jun 28, 2019 at 2:01 PM Вячеслав
Коптилин <
> > > > > >> >> slava.koptilin@gmail.com> wrote:
> > > > > >> >>
> > > > > >> >>> Hi Anton,
> > > > > >> >>>
> > > > > >> >>> I will take a look at your pull request if
you don't mind.
> > > > > >> >>>
> > > > > >> >>> In any way, could you please update the IEP
page with the
> list
> > > of
> > > > > >> >>> constraints/limitations of the proposed approach,
TODOs,
> etc?
> > > > > >> >>> For instance, I would like to see all these
limitations on
> the
> > > IEP
> > > > > >> page
> > > > > >> >>> as
> > > > > >> >>> JIRA tickets. Perhaps, it would be good to
create an
> > > epic/umbrella
> > > > > >> ticket
> > > > > >> >>> in order to track all activities related to
`Read Repair`
> > > feature.
> > > > > >> >>>
> > > > > >> >>> Thanks,
> > > > > >> >>> S.
> > > > > >> >>>
> > > > > >> >>> чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov
<
> av@apache.org
> > >:
> > > > > >> >>>
> > > > > >> >>> > Igniters,
> > > > > >> >>> > I'm glad to introduce Read Repair feature
[0] provides
> > > > additional
> > > > > >> >>> > consistency guarantee for Ignite.
> > > > > >> >>> >
> > > > > >> >>> > 1) Why we need it?
> > > > > >> >>> > The detailed explanation can be found
at IEP-31 [1].
> > > > > >> >>> > In short, because of bugs, it's possible
to gain an
> > > inconsistent
> > > > > >> state.
> > > > > >> >>> > We need additional features to handle
this case.
> > > > > >> >>> >
> > > > > >> >>> > Currently we able to check cluster using
Idle_verify [2]
> > > > feature,
> > > > > >> but
> > > > > >> >>> it
> > > > > >> >>> > will not fix the data, will not even tell
which entries
> are
> > > > > broken.
> > > > > >> >>> > Read Repair is a feature to understand
which entries are
> > > broken
> > > > > and
> > > > > >> to
> > > > > >> >>> fix
> > > > > >> >>> > them.
> > > > > >> >>> >
> > > > > >> >>> > 1) How it works?
> > > > > >> >>> > IgniteCache now able to provide special
proxy [3]
> > > > > withReadRepair().
> > > > > >> >>> > This proxy guarantee that data will be
gained from all
> > owners
> > > > and
> > > > > >> >>> compared.
> > > > > >> >>> > In the case of consistency violation situation,
data will
> be
> > > > > >> recovered
> > > > > >> >>> and
> > > > > >> >>> > a special event recorded.
> > > > > >> >>> >
> > > > > >> >>> > 3) Naming?
> > > > > >> >>> > Feature name based on Cassandra's Read
Repair feature [4],
> > > which
> > > > > is
> > > > > >> >>> pretty
> > > > > >> >>> > similar.
> > > > > >> >>> >
> > > > > >> >>> > 4) Limitations which can be fixed in the
future?
> > > > > >> >>> >   * MVCC and Near caches are not supported.
> > > > > >> >>> >   * Atomic caches can be checked (false
positive case is
> > > > possible
> > > > > on
> > > > > >> >>> this
> > > > > >> >>> > check), but can't be recovered.
> > > > > >> >>> >   * Partial entry removal can't be recovered.
> > > > > >> >>> >   * Entries streamed using data streamer
(using not a
> > > > "cache.put"
> > > > > >> based
> > > > > >> >>> > updater) and loaded by cache.load
> > > > > >> >>> >   are perceived as inconsistent since
they may have
> > different
> > > > > >> versions
> > > > > >> >>> for
> > > > > >> >>> > same keys.
> > > > > >> >>> >   * Only explicit get operations are supported
> > (getAndReplace,
> > > > > >> >>> getAndPut,
> > > > > >> >>> > etc can be supported in future).
> > > > > >> >>> >
> > > > > >> >>> > 5) What's left?
> > > > > >> >>> >   * SQL/ThinClient/etc support.
> > > > > >> >>> >   * Metrics (found/repaired).
> > > > > >> >>> >   * Simple per-partition recovery feature
able to work in
> > the
> > > > > >> >>> background in
> > > > > >> >>> > addition to per-entry recovery feature.
> > > > > >> >>> >
> > > > > >> >>> > 6) Is code checked?
> > > > > >> >>> >   * Pull Request #5656 [5] (feature) -
has green TC.
> > > > > >> >>> >   * Pull Request #6575 [6] (RunAll with
the feature
> enabled
> > > for
> > > > > >> every
> > > > > >> >>> get()
> > > > > >> >>> > request) - has a limited amount of failures
(because of
> data
> > > > > >> streamer,
> > > > > >> >>> > cache.load, etc).
> > > > > >> >>> >
> > > > > >> >>> > Thoughts?
> > > > > >> >>> >
> > > > > >> >>> > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> > > > > >> >>> > [1]
> > > > > >> >>> >
> > > > > >> >>> >
> > > > > >> >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > > > > >> >>> > [2]
> > > > > >> >>> >
> > > > > >> >>> >
> > > > > >> >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > > > >> >>> > [3]
> > > > > >> >>> >
> > > > > >> >>> >
> > > > > >> >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > > > > >> >>> > [4]
> > > > > >> >>> >
> > > > > >> >>> >
> > > > > >> >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > > > > >> >>> > [5] https://github.com/apache/ignite/pull/5656
> > > > > >> >>> > [6] https://github.com/apache/ignite/pull/6575
> > > > > >> >>> >
> > > > > >> >>>
> > > > > >> >>
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message