ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anton Vinogradov ...@apache.org>
Subject Re: Read Repair (ex. Consistency Check) - review request #2
Date Thu, 11 Jul 2019 07:29:08 GMT
Nikolay,

Initial idea is to count fixes per cache.
In other words, event recording should cause metric increment.
Could you help me with RR metrics implementation as a part of your metrics
journey?

On Thu, Jul 11, 2019 at 10:18 AM Nikolay Izhikov <nizhikov@apache.org>
wrote:

> Anton,
>
> > - metrics are better for monitoring, but the Event is enough for my wish
> to cover AI with consistency check,
>
> Can you clarify, do you have plans to add metrics of RR events?
>
> I think it should be count of incosistency events per cache(maybe per
> partition).
>
>
> В Чт, 11/07/2019 в 09:53 +0300, Anton Vinogradov пишет:
> > Slava,
> >
> > Thanks for your review first!
> >
> > > > Anyway, I left some comments in your pull-request at github. Please
> take
> >
> > a
> > > > look. The most annoying thing is poorly documented code :(
> >
> > Since your comments mostly about Javadoc (does this mean that my solution
> > is so great that you ask me only to fix Javadocs :) ?),
> > I'd like to propose you to provide fixes as a PR since you have a vision
> of
> > how it should be made. I'll review them and merge shortly.
> >
> > > > By the way, is it required to add test related to fail-over
> scenarios?
> >
> > The best check is to use RR at real code.
> > For example, I'm injecting RR now to the test with concurrent
> modifications
> > and restarts [1].
> >
> > > > I just checked, the IEP page and I still cannot find Jira tickets
> that
> > > > should cover existing limitations/improvements.
> > > > I would suggest creating the following tasks, at least:
> >
> > Mostly agree with you, but
> > - MVCC is not production ready,
> > - not sure near support really required,
> > - metrics are better for monitoring, but the Event is enough for my wish
> to
> > cover AI with consistency check,
> > - do we really need Platforms and Thin Client support?
> > Also, we should not produce stillborn issue.
> > All limitations listed at proxy creation method and they definitely are
> not
> > showstoppers and may be fixed later if someone interested.
> > Сoming back to AI 3.0 discussion, we have A LOT of features and it's
> almost
> > impossible (require much more time that feature's cost) to support them
> all.
> > I will be pretty happy in case someone will do this and provide help if
> > necessary!
> >
> > > > Perhaps, it would be a good idea to think about the recovery too
> >
> > Do you mean per partition check and recovery?
> > That's a good idea, but I found it's not easy to imagine API to for such
> > tool.
> > In case you ready to assist with proper API/design this will definitely
> > help.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-11973
> >
> > On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <
> slava.koptilin@gmail.com>
> > wrote:
> >
> > > Perhaps, it would be a good idea to think about the recovery tool/
> > > control-utility command that will allow achieving the same goal.
> > > If I am not mistaken it was already proposed in the email thread.
> > >
> > > Thanks,
> > > S.
> > >
> > > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <
> slava.koptilin@gmail.com>:
> > >
> > > > Hi Anton,
> > > >
> > > > Well, the ReadRepair feature is finally merged and that is good news
> :)
> > > >
> > > > Unfortunately, I cannot find a consensus about the whole
> functionality in
> > > > any of these topics:
> > > >  -
> > > >
> > >
> > >
> http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > > >  -
> > > >
> > >
> > >
> http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > > Also, there are no comments/discussion in JIRA. That makes me sad :(
> > > > especially when a feature is huge, not obvious and involves changing
> > >
> > > public
> > > > API (and that is the case, I think).
> > > >
> > > > Anyway, I left some comments in your pull-request at github. Please
> take
> > >
> > > a
> > > > look. The most annoying thing is poorly documented code :(
> > > > By the way, is it required to add test related to fail-over
> scenarios?
> > > >
> > > > I just checked, the IEP page and I still cannot find Jira tickets
> that
> > > > should cover existing limitations/improvements.
> > > > I would suggest creating the following tasks, at least:
> > > >  - MVCC support
> > > >  - Near caches
> > > >  - Additional metrics (number of violations, number of repaired
> entries
> > > > etc)
> > > >  - Ignite C++ (It looks like, .Net is already aware of that feature)
> > > >  - Thin clients support
> > > >  - Perhaps, it would be useful to support different strategies to
> resolve
> > > > inconsistencies
> > > >
> > > > What do you think?
> > > >
> > > > Thanks,
> > > > S.
> > > >
> > > >
> > > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av@apache.org>:
> > > >
> > > > > Folks,
> > > > >
> > > > > Thanks to everyone for tips and reviews.
> > > > > Yardstick checked, no performance drop found.
> > > > > Additional measurement: RR get() is just up to 7% slower than
> regular
> > >
> > > get
> > > > > on real benchmarks (8 clients, 4 servers, 3 backups)
> > > > > Code merged to the master.
> > > > > "Must have" tasks created and attached to the IEP.
> > > > >
> > > > > On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av@apache.org>
> wrote:
> > > > >
> > > > > > Folks,
> > > > > >
> > > > > > Just a minor update.
> > > > > >
> > > > > > RunAll [1] with enabled ReadRepair proxy is almost green now
(~10
> > >
> > > tests
> > > > > > left, started with 6k :)).
> > > > > > During the analisys, I've found some tests with
> > > > > > - unexpected repairs at tx caches
> > > > > > - inconsistent state after the test finished (different entries
> across
> > > > >
> > > > > the
> > > > > > topology)
> > > > > > For example,
> > > > > > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates
> obsolete
> > > > > > versions on backups in case of retry, fixed [2]
> > > > > > - initial cache load generates not equal versions on backups,
> fixed
> > >
> > > [3]
> > > > > > - testAccountTxNodeRestart causes unexpected repairs (entries
> have
> > > > > > different versions), to be investigated.
> > > > > >
> > > > > > What's next?
> > > > > >
> > > > > > 1) Going to merge the solution once "RunAll with ReadRepair
> enabled"
> > > > > > becomes fully green.
> > > > > > 2) Going to add special check after each test which will ensure
> caches
> > > > > > content after the test is consistent.
> > > > > > 2.1) The Same check can (should?) be injected to
> > > > > > awaitPartitionMapExchange() and similar methods.
> > > > > > 3) Update Jepsen tests with RR checks.
> > > > > > 4) Introduce per partition RR.
> > > > > >
> > > > > > So, the final goal is to be sure that Ignite produces only
> consistent
> > > > >
> > > > > data
> > > > > > and to have a feature to solve consistency in case we gain
> > >
> > > inconsistent
> > > > > > state somehow.
> > > > > >
> > > > > > Limitations?
> > > > > >
> > > > > > Currently, RR has some limitations, but they are not related
to
> real
> > > > > > production cases.
> > > > > > In case someone interested to support, for example, MVCC or
near
> > >
> > > caches,
> > > > > > please, feel free to contribute.
> > > > > >
> > > > > > [1]
> > > > > >
> > >
> > >
> https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > > > > > [2]
> > > > > >
> > >
> > >
> https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > > > > > [3]
> > > > > >
> > >
> > >
> https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > > > > >
> > > > > > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av@apache.org>
> > >
> > > wrote:
> > > > > >
> > > > > > > Slava,
> > > > > > >
> > > > > > > > > I will take a look at your pull request if you
don't mind.
> > > > > > >
> > > > > > > Great news!
> > > > > > >
> > > > > > > > > In any way, could you please update the IEP page
with the
> list of
> > > > > > > > > constraints/limitations of the proposed approach,
TODOs,
> etc?
> > > > > > >
> > > > > > > Not sure we should keep this at IEP until list (#4 from
> original
> > > > >
> > > > > letter)
> > > > > > > is not confirmed.
> > > > > > >
> > > > > > > Currently, I'm checking RunAll with RR enabled to almost
each
> get
> > > > >
> > > > > request.
> > > > > > > "Almost" means: readRepair = !ctx.readThrough() &&
> > > > > > > ctx.config().getBackups() > 0 && !ctx.isNear()
&&
> !ctx.mvccEnabled()
> > > > > > > For now I have 60 failed tests and amount decreasing.
> > > > > > >
> > > > > > > > > For instance, I would like to see all these limitations
on
> the IEP
> > > > > > >
> > > > > > > page as
> > > > > > > > > JIRA tickets. Perhaps, it would be good to create
an
> epic/umbrella
> > > > > > >
> > > > > > > ticket
> > > > > > > > > in order to track all activities related to `Read
Repair`
> feature.
> > > > > > >
> > > > > > > Let's do this at merge day to be sure useless issues will
not
> be
> > > > >
> > > > > created.
> > > > > > >
> > > > > > > On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин
<
> > > > > > > slava.koptilin@gmail.com> wrote:
> > > > > > >
> > > > > > > > Hi Anton,
> > > > > > > >
> > > > > > > > I will take a look at your pull request if you don't
mind.
> > > > > > > >
> > > > > > > > In any way, could you please update the IEP page with
the
> list of
> > > > > > > > constraints/limitations of the proposed approach,
TODOs, etc?
> > > > > > > > For instance, I would like to see all these limitations
on
> the IEP
> > > > >
> > > > > page
> > > > > > > > as
> > > > > > > > JIRA tickets. Perhaps, it would be good to create
an
> epic/umbrella
> > > > >
> > > > > ticket
> > > > > > > > in order to track all activities related to `Read
Repair`
> feature.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > S.
> > > > > > > >
> > > > > > > > чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov
<av@apache.org
> >:
> > > > > > > >
> > > > > > > > > Igniters,
> > > > > > > > > I'm glad to introduce Read Repair feature [0]
provides
> additional
> > > > > > > > > consistency guarantee for Ignite.
> > > > > > > > >
> > > > > > > > > 1) Why we need it?
> > > > > > > > > The detailed explanation can be found at IEP-31
[1].
> > > > > > > > > In short, because of bugs, it's possible to gain
an
> inconsistent
> > > > >
> > > > > state.
> > > > > > > > > We need additional features to handle this case.
> > > > > > > > >
> > > > > > > > > Currently we able to check cluster using Idle_verify
[2]
> feature,
> > > > >
> > > > > but
> > > > > > > > it
> > > > > > > > > will not fix the data, will not even tell which
entries are
> > >
> > > broken.
> > > > > > > > > Read Repair is a feature to understand which
entries are
> broken
> > >
> > > and
> > > > > to
> > > > > > > > fix
> > > > > > > > > them.
> > > > > > > > >
> > > > > > > > > 1) How it works?
> > > > > > > > > IgniteCache now able to provide special proxy
[3]
> > >
> > > withReadRepair().
> > > > > > > > > This proxy guarantee that data will be gained
from all
> owners and
> > > > > > > >
> > > > > > > > compared.
> > > > > > > > > In the case of consistency violation situation,
data will
> be
> > > > >
> > > > > recovered
> > > > > > > > and
> > > > > > > > > a special event recorded.
> > > > > > > > >
> > > > > > > > > 3) Naming?
> > > > > > > > > Feature name based on Cassandra's Read Repair
feature [4],
> which
> > >
> > > is
> > > > > > > > pretty
> > > > > > > > > similar.
> > > > > > > > >
> > > > > > > > > 4) Limitations which can be fixed in the future?
> > > > > > > > >   * MVCC and Near caches are not supported.
> > > > > > > > >   * Atomic caches can be checked (false positive
case is
> possible
> > >
> > > on
> > > > > > > > this
> > > > > > > > > check), but can't be recovered.
> > > > > > > > >   * Partial entry removal can't be recovered.
> > > > > > > > >   * Entries streamed using data streamer (using
not a
> "cache.put"
> > > > >
> > > > > based
> > > > > > > > > updater) and loaded by cache.load
> > > > > > > > >   are perceived as inconsistent since they may
have
> different
> > > > >
> > > > > versions
> > > > > > > > for
> > > > > > > > > same keys.
> > > > > > > > >   * Only explicit get operations are supported
> (getAndReplace,
> > > > > > > >
> > > > > > > > getAndPut,
> > > > > > > > > etc can be supported in future).
> > > > > > > > >
> > > > > > > > > 5) What's left?
> > > > > > > > >   * SQL/ThinClient/etc support.
> > > > > > > > >   * Metrics (found/repaired).
> > > > > > > > >   * Simple per-partition recovery feature able
to work in
> the
> > > > > > > >
> > > > > > > > background in
> > > > > > > > > addition to per-entry recovery feature.
> > > > > > > > >
> > > > > > > > > 6) Is code checked?
> > > > > > > > >   * Pull Request #5656 [5] (feature) - has green
TC.
> > > > > > > > >   * Pull Request #6575 [6] (RunAll with the feature
> enabled for
> > > > >
> > > > > every
> > > > > > > > get()
> > > > > > > > > request) - has a limited amount of failures (because
of
> data
> > > > >
> > > > > streamer,
> > > > > > > > > cache.load, etc).
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > > >
> > >
> > >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > > > > > > > > [2]
> > > > > > > > >
> > > > > > > > >
> > >
> > >
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > > > > > > > [3]
> > > > > > > > >
> > > > > > > > >
> > >
> > >
> https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > > > > > > > > [4]
> > > > > > > > >
> > > > > > > > >
> > >
> > >
> https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > > > > > > > > [5] https://github.com/apache/ignite/pull/5656
> > > > > > > > > [6] https://github.com/apache/ignite/pull/6575
> > > > > > > > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message