ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikolay Izhikov <nizhi...@apache.org>
Subject Re: Read Repair (ex. Consistency Check) - review request #2
Date Thu, 11 Jul 2019 07:20:47 GMT
Anton,

> - metrics are better for monitoring, but the Event is enough for my wish to cover AI
with consistency check,

Can you clarify, do you have plans to add metrics of RR events?

I think it should be count of incosistency events per cache(maybe per partition).


В Чт, 11/07/2019 в 09:53 +0300, Anton Vinogradov пишет:
> Slava,
> 
> Thanks for your review first!
> 
> > > Anyway, I left some comments in your pull-request at github. Please take
> 
> a
> > > look. The most annoying thing is poorly documented code :(
> 
> Since your comments mostly about Javadoc (does this mean that my solution
> is so great that you ask me only to fix Javadocs :) ?),
> I'd like to propose you to provide fixes as a PR since you have a vision of
> how it should be made. I'll review them and merge shortly.
> 
> > > By the way, is it required to add test related to fail-over scenarios?
> 
> The best check is to use RR at real code.
> For example, I'm injecting RR now to the test with concurrent modifications
> and restarts [1].
> 
> > > I just checked, the IEP page and I still cannot find Jira tickets that
> > > should cover existing limitations/improvements.
> > > I would suggest creating the following tasks, at least:
> 
> Mostly agree with you, but
> - MVCC is not production ready,
> - not sure near support really required,
> - metrics are better for monitoring, but the Event is enough for my wish to
> cover AI with consistency check,
> - do we really need Platforms and Thin Client support?
> Also, we should not produce stillborn issue.
> All limitations listed at proxy creation method and they definitely are not
> showstoppers and may be fixed later if someone interested.
> Сoming back to AI 3.0 discussion, we have A LOT of features and it's almost
> impossible (require much more time that feature's cost) to support them all.
> I will be pretty happy in case someone will do this and provide help if
> necessary!
> 
> > > Perhaps, it would be a good idea to think about the recovery too
> 
> Do you mean per partition check and recovery?
> That's a good idea, but I found it's not easy to imagine API to for such
> tool.
> In case you ready to assist with proper API/design this will definitely
> help.
> 
> [1] https://issues.apache.org/jira/browse/IGNITE-11973
> 
> On Wed, Jul 10, 2019 at 3:43 PM Вячеслав Коптилин <slava.koptilin@gmail.com>
> wrote:
> 
> > Perhaps, it would be a good idea to think about the recovery tool/
> > control-utility command that will allow achieving the same goal.
> > If I am not mistaken it was already proposed in the email thread.
> > 
> > Thanks,
> > S.
> > 
> > ср, 10 июл. 2019 г. в 15:33, Вячеслав Коптилин <slava.koptilin@gmail.com>:
> > 
> > > Hi Anton,
> > > 
> > > Well, the ReadRepair feature is finally merged and that is good news :)
> > > 
> > > Unfortunately, I cannot find a consensus about the whole functionality in
> > > any of these topics:
> > >  -
> > > 
> > 
> > http://apache-ignite-developers.2346864.n4.nabble.com/Consistency-check-and-fix-review-request-td41629.html
> > >  -
> > > 
> > 
> > http://apache-ignite-developers.2346864.n4.nabble.com/Read-Repair-ex-Consistency-Check-review-request-2-td42421.html
> > > Also, there are no comments/discussion in JIRA. That makes me sad :(
> > > especially when a feature is huge, not obvious and involves changing
> > 
> > public
> > > API (and that is the case, I think).
> > > 
> > > Anyway, I left some comments in your pull-request at github. Please take
> > 
> > a
> > > look. The most annoying thing is poorly documented code :(
> > > By the way, is it required to add test related to fail-over scenarios?
> > > 
> > > I just checked, the IEP page and I still cannot find Jira tickets that
> > > should cover existing limitations/improvements.
> > > I would suggest creating the following tasks, at least:
> > >  - MVCC support
> > >  - Near caches
> > >  - Additional metrics (number of violations, number of repaired entries
> > > etc)
> > >  - Ignite C++ (It looks like, .Net is already aware of that feature)
> > >  - Thin clients support
> > >  - Perhaps, it would be useful to support different strategies to resolve
> > > inconsistencies
> > > 
> > > What do you think?
> > > 
> > > Thanks,
> > > S.
> > > 
> > > 
> > > ср, 10 июл. 2019 г. в 10:16, Anton Vinogradov <av@apache.org>:
> > > 
> > > > Folks,
> > > > 
> > > > Thanks to everyone for tips and reviews.
> > > > Yardstick checked, no performance drop found.
> > > > Additional measurement: RR get() is just up to 7% slower than regular
> > 
> > get
> > > > on real benchmarks (8 clients, 4 servers, 3 backups)
> > > > Code merged to the master.
> > > > "Must have" tasks created and attached to the IEP.
> > > > 
> > > > On Thu, Jul 4, 2019 at 12:18 PM Anton Vinogradov <av@apache.org>
wrote:
> > > > 
> > > > > Folks,
> > > > > 
> > > > > Just a minor update.
> > > > > 
> > > > > RunAll [1] with enabled ReadRepair proxy is almost green now (~10
> > 
> > tests
> > > > > left, started with 6k :)).
> > > > > During the analisys, I've found some tests with
> > > > > - unexpected repairs at tx caches
> > > > > - inconsistent state after the test finished (different entries across
> > > > 
> > > > the
> > > > > topology)
> > > > > For example,
> > > > > - testInvokeAllAppliedOnceOnBinaryTypeRegistration generates obsolete
> > > > > versions on backups in case of retry, fixed [2]
> > > > > - initial cache load generates not equal versions on backups, fixed
> > 
> > [3]
> > > > > - testAccountTxNodeRestart causes unexpected repairs (entries have
> > > > > different versions), to be investigated.
> > > > > 
> > > > > What's next?
> > > > > 
> > > > > 1) Going to merge the solution once "RunAll with ReadRepair enabled"
> > > > > becomes fully green.
> > > > > 2) Going to add special check after each test which will ensure caches
> > > > > content after the test is consistent.
> > > > > 2.1) The Same check can (should?) be injected to
> > > > > awaitPartitionMapExchange() and similar methods.
> > > > > 3) Update Jepsen tests with RR checks.
> > > > > 4) Introduce per partition RR.
> > > > > 
> > > > > So, the final goal is to be sure that Ignite produces only consistent
> > > > 
> > > > data
> > > > > and to have a feature to solve consistency in case we gain
> > 
> > inconsistent
> > > > > state somehow.
> > > > > 
> > > > > Limitations?
> > > > > 
> > > > > Currently, RR has some limitations, but they are not related to real
> > > > > production cases.
> > > > > In case someone interested to support, for example, MVCC or near
> > 
> > caches,
> > > > > please, feel free to contribute.
> > > > > 
> > > > > [1]
> > > > > 
> > 
> > https://mtcga.gridgain.com/pr.html?serverId=apache&suiteId=IgniteTests24Java8_RunAll&branchForTc=pull/6575/head&action=Latest
> > > > > [2]
> > > > > 
> > 
> > https://github.com/apache/ignite/pull/5656/commits/6f6ec4434095e692af209c61833a350f3013408c
> > > > > [3]
> > > > > 
> > 
> > https://github.com/apache/ignite/pull/5656/commits/255e552b474839e470c66a77e74e3c807bc76f13
> > > > > 
> > > > > On Fri, Jun 28, 2019 at 2:41 PM Anton Vinogradov <av@apache.org>
> > 
> > wrote:
> > > > > 
> > > > > > Slava,
> > > > > > 
> > > > > > > > I will take a look at your pull request if you don't
mind.
> > > > > > 
> > > > > > Great news!
> > > > > > 
> > > > > > > > In any way, could you please update the IEP page with
the list of
> > > > > > > > constraints/limitations of the proposed approach,
TODOs, etc?
> > > > > > 
> > > > > > Not sure we should keep this at IEP until list (#4 from original
> > > > 
> > > > letter)
> > > > > > is not confirmed.
> > > > > > 
> > > > > > Currently, I'm checking RunAll with RR enabled to almost each
get
> > > > 
> > > > request.
> > > > > > "Almost" means: readRepair = !ctx.readThrough() &&
> > > > > > ctx.config().getBackups() > 0 && !ctx.isNear() &&
!ctx.mvccEnabled()
> > > > > > For now I have 60 failed tests and amount decreasing.
> > > > > > 
> > > > > > > > For instance, I would like to see all these limitations
on the IEP
> > > > > > 
> > > > > > page as
> > > > > > > > JIRA tickets. Perhaps, it would be good to create
an epic/umbrella
> > > > > > 
> > > > > > ticket
> > > > > > > > in order to track all activities related to `Read
Repair` feature.
> > > > > > 
> > > > > > Let's do this at merge day to be sure useless issues will not
be
> > > > 
> > > > created.
> > > > > > 
> > > > > > On Fri, Jun 28, 2019 at 2:01 PM Вячеслав Коптилин
<
> > > > > > slava.koptilin@gmail.com> wrote:
> > > > > > 
> > > > > > > Hi Anton,
> > > > > > > 
> > > > > > > I will take a look at your pull request if you don't mind.
> > > > > > > 
> > > > > > > In any way, could you please update the IEP page with the
list of
> > > > > > > constraints/limitations of the proposed approach, TODOs,
etc?
> > > > > > > For instance, I would like to see all these limitations
on the IEP
> > > > 
> > > > page
> > > > > > > as
> > > > > > > JIRA tickets. Perhaps, it would be good to create an epic/umbrella
> > > > 
> > > > ticket
> > > > > > > in order to track all activities related to `Read Repair`
feature.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > S.
> > > > > > > 
> > > > > > > чт, 20 июн. 2019 г. в 14:15, Anton Vinogradov <av@apache.org>:
> > > > > > > 
> > > > > > > > Igniters,
> > > > > > > > I'm glad to introduce Read Repair feature [0] provides
additional
> > > > > > > > consistency guarantee for Ignite.
> > > > > > > > 
> > > > > > > > 1) Why we need it?
> > > > > > > > The detailed explanation can be found at IEP-31 [1].
> > > > > > > > In short, because of bugs, it's possible to gain an
inconsistent
> > > > 
> > > > state.
> > > > > > > > We need additional features to handle this case.
> > > > > > > > 
> > > > > > > > Currently we able to check cluster using Idle_verify
[2] feature,
> > > > 
> > > > but
> > > > > > > it
> > > > > > > > will not fix the data, will not even tell which entries
are
> > 
> > broken.
> > > > > > > > Read Repair is a feature to understand which entries
are broken
> > 
> > and
> > > > to
> > > > > > > fix
> > > > > > > > them.
> > > > > > > > 
> > > > > > > > 1) How it works?
> > > > > > > > IgniteCache now able to provide special proxy [3]
> > 
> > withReadRepair().
> > > > > > > > This proxy guarantee that data will be gained from
all owners and
> > > > > > > 
> > > > > > > compared.
> > > > > > > > In the case of consistency violation situation, data
will be
> > > > 
> > > > recovered
> > > > > > > and
> > > > > > > > a special event recorded.
> > > > > > > > 
> > > > > > > > 3) Naming?
> > > > > > > > Feature name based on Cassandra's Read Repair feature
[4], which
> > 
> > is
> > > > > > > pretty
> > > > > > > > similar.
> > > > > > > > 
> > > > > > > > 4) Limitations which can be fixed in the future?
> > > > > > > >   * MVCC and Near caches are not supported.
> > > > > > > >   * Atomic caches can be checked (false positive case
is possible
> > 
> > on
> > > > > > > this
> > > > > > > > check), but can't be recovered.
> > > > > > > >   * Partial entry removal can't be recovered.
> > > > > > > >   * Entries streamed using data streamer (using not
a "cache.put"
> > > > 
> > > > based
> > > > > > > > updater) and loaded by cache.load
> > > > > > > >   are perceived as inconsistent since they may have
different
> > > > 
> > > > versions
> > > > > > > for
> > > > > > > > same keys.
> > > > > > > >   * Only explicit get operations are supported (getAndReplace,
> > > > > > > 
> > > > > > > getAndPut,
> > > > > > > > etc can be supported in future).
> > > > > > > > 
> > > > > > > > 5) What's left?
> > > > > > > >   * SQL/ThinClient/etc support.
> > > > > > > >   * Metrics (found/repaired).
> > > > > > > >   * Simple per-partition recovery feature able to
work in the
> > > > > > > 
> > > > > > > background in
> > > > > > > > addition to per-entry recovery feature.
> > > > > > > > 
> > > > > > > > 6) Is code checked?
> > > > > > > >   * Pull Request #5656 [5] (feature) - has green TC.
> > > > > > > >   * Pull Request #6575 [6] (RunAll with the feature
enabled for
> > > > 
> > > > every
> > > > > > > get()
> > > > > > > > request) - has a limited amount of failures (because
of data
> > > > 
> > > > streamer,
> > > > > > > > cache.load, etc).
> > > > > > > > 
> > > > > > > > Thoughts?
> > > > > > > > 
> > > > > > > > [0] https://issues.apache.org/jira/browse/IGNITE-10663
> > > > > > > > [1]
> > > > > > > > 
> > > > > > > > 
> > 
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-31+Consistency+check+and+fix
> > > > > > > > [2]
> > > > > > > > 
> > > > > > > > 
> > 
> > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > > > > > > [3]
> > > > > > > > 
> > > > > > > > 
> > 
> > https://github.com/apache/ignite/blob/27b6105ecc175b61e0aef59887830588dfc388ef/modules/core/src/main/java/org/apache/ignite/IgniteCache.java#L140
> > > > > > > > [4]
> > > > > > > > 
> > > > > > > > 
> > 
> > https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
> > > > > > > > [5] https://github.com/apache/ignite/pull/5656
> > > > > > > > [6] https://github.com/apache/ignite/pull/6575
> > > > > > > > 

Mime
View raw message