ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anton Vinogradov ...@apache.org>
Subject Re: "Idle verify" to "Online verify"
Date Mon, 06 May 2019 09:27:11 GMT
Ivan, just to make sure ...
The discussed case will fully solve the issue [1] in case we'll also add
some strategy to reject partitions with missed updates (updateCnt==Ok,
Hash!=Ok).
For example, we may use the Quorum strategy, when the majority wins.
Sounds correct?

[1] https://issues.apache.org/jira/browse/IGNITE-10078

On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <av@apache.org> wrote:

> Ivan,
>
> Thanks for the detailed explanation.
> I'll try to implement the PoC to check the idea.
>
> On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <ivan.glukos@gmail.com> wrote:
>
>> > But how to keep this hash?
>> I think, we can just adopt way of storing partition update counters.
>> Update counters are:
>> 1) Kept and updated in heap, see
>> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed during
>> regular cache operations, no page replacement latency issues)
>> 2) Synchronized with page memory (and with disk) on every checkpoint,
>> see GridCacheOffheapManager#saveStoreMetadata
>> 3) Stored in partition meta page, see PagePartitionMetaIO#setUpdateCounter
>> 4) On node restart, we init onheap counter with value from disk (for the
>> moment of last checkpoint) and update it to latest value during WAL
>> logical records replay
>>
>> > 2) PME is a rare operation on production cluster, but, seems, we have
>> > to check consistency in a regular way.
>> > Since we have to finish all operations before the check, should we
>> > have fake PME for maintenance check in this case?
>>  From my experience, PME happens on prod clusters from time to time
>> (several times per week), which can be enough. In case it's needed to
>> check consistency more often than regular PMEs occur, we can implement
>> command that will trigger fake PME for consistency checking.
>>
>> Best Regards,
>> Ivan Rakov
>>
>> On 29.04.2019 18:53, Anton Vinogradov wrote:
>> > Ivan, thanks for the analysis!
>> >
>> > >> With having pre-calculated partition hash value, we can
>> > automatically detect inconsistent partitions on every PME.
>> > Great idea, seems this covers all broken synс cases.
>> >
>> > It will check alive nodes in case the primary failed immediately
>> > and will check rejoining node once it finished a rebalance (PME on
>> > becoming an owner).
>> > Recovered cluster will be checked on activation PME (or even before
>> > that?).
>> > Also, warmed cluster will be still warmed after check.
>> >
>> > Have I missed some cases leads to broken sync except bugs?
>> >
>> > 1) But how to keep this hash?
>> > - It should be automatically persisted on each checkpoint (it should
>> > not require recalculation on restore, snapshots should be covered too)
>> > (and covered by WAL?).
>> > - It should be always available at RAM for every partition (even for
>> > cold partitions never updated/readed on this node) to be immediately
>> > used once all operations done on PME.
>> >
>> > Can we have special pages to keep such hashes and never allow their
>> > eviction?
>> >
>> > 2) PME is a rare operation on production cluster, but, seems, we have
>> > to check consistency in a regular way.
>> > Since we have to finish all operations before the check, should we
>> > have fake PME for maintenance check in this case?
>> >
>> > On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <ivan.glukos@gmail.com
>> > <mailto:ivan.glukos@gmail.com>> wrote:
>> >
>> >     Hi Anton,
>> >
>> >     Thanks for sharing your ideas.
>> >     I think your approach should work in general. I'll just share my
>> >     concerns about possible issues that may come up.
>> >
>> >     1) Equality of update counters doesn't imply equality of
>> >     partitions content under load.
>> >     For every update, primary node generates update counter and then
>> >     update is delivered to backup node and gets applied with the
>> >     corresponding update counter. For example, there are two
>> >     transactions (A and B) that update partition X by the following
>> >     scenario:
>> >     - A updates key1 in partition X on primary node and increments
>> >     counter to 10
>> >     - B updates key2 in partition X on primary node and increments
>> >     counter to 11
>> >     - While A is still updating another keys, B is finally committed
>> >     - Update of key2 arrives to backup node and sets update counter to
>> 11
>> >     Observer will see equal update counters (11), but update of key 1
>> >     is still missing in the backup partition.
>> >     This is a fundamental problem which is being solved here:
>> >     https://issues.apache.org/jira/browse/IGNITE-10078
>> >     "Online verify" should operate with new complex update counters
>> >     which take such "update holes" into account. Otherwise, online
>> >     verify may provide false-positive inconsistency reports.
>> >
>> >     2) Acquisition and comparison of update counters is fast, but
>> >     partition hash calculation is long. We should check that update
>> >     counter remains unchanged after every K keys handled.
>> >
>> >     3)
>> >
>> >>     Another hope is that we'll be able to pause/continue scan, for
>> >>     example, we'll check 1/3 partitions today, 1/3 tomorrow, and in
>> >>     three days we'll check the whole cluster.
>> >     Totally makes sense.
>> >     We may find ourselves into a situation where some "hot" partitions
>> >     are still unprocessed, and every next attempt to calculate
>> >     partition hash fails due to another concurrent update. We should
>> >     be able to track progress of validation (% of calculation time
>> >     wasted due to concurrent operations may be a good metric, 100% is
>> >     the worst case) and provide option to stop/pause activity.
>> >     I think, pause should return an "intermediate results report" with
>> >     information about which partitions have been successfully checked.
>> >     With such report, we can resume activity later: partitions from
>> >     report will be just skipped.
>> >
>> >     4)
>> >
>> >>     Since "Idle verify" uses regular pagmem, I assume it replaces hot
>> >>     data with persisted.
>> >>     So, we have to warm up the cluster after each check.
>> >>     Are there any chances to check without cooling the cluster?
>> >     I don't see an easy way to achieve it with our page memory
>> >     architecture. We definitely can't just read pages from disk
>> >     directly: we need to synchronize page access with concurrent
>> >     update operations and checkpoints.
>> >     From my point of view, the correct way to solve this issue is
>> >     improving our page replacement [1] mechanics by making it truly
>> >     scan-resistant.
>> >
>> >     P. S. There's another possible way of achieving online verify:
>> >     instead of on-demand hash calculation, we can always keep
>> >     up-to-date hash value for every partition. We'll need to update
>> >     hash on every insert/update/remove operation, but there will be no
>> >     reordering issues as per function that we use for aggregating hash
>> >     results (+) is commutative. With having pre-calculated partition
>> >     hash value, we can automatically detect inconsistent partitions on
>> >     every PME. What do you think?
>> >
>> >     [1] -
>> >
>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk)
>> >
>> >     Best Regards,
>> >     Ivan Rakov
>> >
>> >     On 29.04.2019 12:20, Anton Vinogradov wrote:
>> >>     Igniters and especially Ivan Rakov,
>> >>
>> >>     "Idle verify" [1] is a really cool tool, to make sure that
>> >>     cluster is consistent.
>> >>
>> >>     1) But it required to have operations paused during cluster check.
>> >>     At some clusters, this check requires hours (3-4 hours at cases I
>> >>     saw).
>> >>     I've checked the code of "idle verify" and it seems it possible
>> >>     to make it "online" with some assumptions.
>> >>
>> >>     Idea:
>> >>     Currently "Idle verify" checks that partitions hashes, generated
>> >>     this way
>> >>     while (it.hasNextX()) {
>> >>     CacheDataRow row = it.nextX();
>> >>     partHash += row.key().hashCode();
>> >>     partHash +=
>> >>
>>  Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext()));
>> >>     }
>> >>     , are the same.
>> >>
>> >>     What if we'll generate same pairs updateCounter-partitionHash but
>> >>     will compare hashes only in case counters are the same?
>> >>     So, for example, will ask cluster to generate pairs for 64
>> >>     partitions, then will find that 55 have the same counters (was
>> >>     not updated during check) and check them.
>> >>     The rest (64-55 = 9) partitions will be re-requested and
>> >>     rechecked with an additional 55.
>> >>     This way we'll be able to check cluster is consistent even in
>> >>     сase operations are in progress (just retrying modified).
>> >>
>> >>     Risks and assumptions:
>> >>     Using this strategy we'll check the cluster's consistency ...
>> >>     eventually, and the check will take more time even on an idle
>> >>     cluster.
>> >>     In case operationsPerTimeToGeneratePartitionHashes >
>> >>     partitionsCount we'll definitely gain no progress.
>> >>     But, in case of the load is not high, we'll be able to check all
>> >>     cluster.
>> >>
>> >>     Another hope is that we'll be able to pause/continue scan, for
>> >>     example, we'll check 1/3 partitions today, 1/3 tomorrow, and in
>> >>     three days we'll check the whole cluster.
>> >>
>> >>     Have I missed something?
>> >>
>> >>     2) Since "Idle verify" uses regular pagmem, I assume it replaces
>> >>     hot data with persisted.
>> >>     So, we have to warm up the cluster after each check.
>> >>     Are there any chances to check without cooling the cluster?
>> >>
>> >>     [1]
>> >>
>> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
>> >
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message