ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Goncharuk <alexey.goncha...@gmail.com>
Subject Re: MVCC version format
Date Fri, 15 Dec 2017 11:37:22 GMT
Vladimir,

Although at the first glance this seems to be ok for a local solution, I
have a few concerns wrt distributed version behavior:

1) I cannot say that epoch switch in this approach will be an extremely
rare event. Let's say a user regularly deploys a new version of application
code to a large Ignite cluster and restarts Ignite nodes for such an
upgrade. If version coordinator is not dedicated, then on average version
coordinator switch will occur N/2 times (N is the size of the cluster), in
the worst case the switch will happen N times. At this rate, epoch switch
may occur maybe once a month depending on a cluster size? The user will
need to take care of the node order restart.
2) What happens if a node went offline, then epoch switched twice, then
node gets back to cluster again (a classic ABA)? The only solution I see
here is that we track epoch switches and drop nodes that were offline
during the epoch switch forever. This may cause problems if a user has
partition loss policy = READ_WRITE_SAFE. Let's say a partition was offline,
but updates to other partitions were allowed. If an epoch switch happens at
this moment, the partition is lost forever.
3) How fast will we be able to flip the frozen bit for the whole dataset if
we are looking at terabytes per node? Given (1), we may be in trouble here.

2017-12-15 12:57 GMT+03:00 Vladimir Ozerov <vozerov@gridgain.com>:

> Igniters,
>
> As you probably know, we are working on MVCC feature [1]. When MVCC is
> enabled every cache operation/tx require one unique version for read and
> possibly one version for write. Versions are assigned by a single node
> called *coordinator*.
>
> Requirements for version:
> 1) Must be increasing and comparable, so that we understand which operation
> occurred earlier.
> 2) Support coordinator failtover
> 3) As compact as possible
>
> Currently we implemented naive approach where version is represented as two
> long values (16 bytes total):
> 1) Coordinator version (a kind of mix of a timestamp and topology version)
> 2) Local coordinator counter.
>
> Other vendors typically assign 4 byte (long) or 8 byte (long) versions,
> which is definitely better as they are faster to compare and require less
> space, what could be especially important for small values, such as
> indexes.
>
> I thought a bit, and came to conclusion that we could fit into 64 bit
> version (8 bytes) as follows:
> [0..47] - local coordinator counter; this gives capacity 10^14, which
> should be enough for any sensible workload
> [48..60] - coordinator version, which is incremented on every coordinator
> change (relatively rare) or local counter wraparound (very rare); this
> gives 8192 versions
> [61] - coordinator epoch
> [62] - delete marker, set on entry removal, cleared on resurrection (e.g.
> TX rollback, or index change A->B->A)
> [63] - freeze marker
>
> Mechanics:
> 1) Every coordinator change increments coordinator version and share it via
> discovery
> 2) If persistence is enabled, coordinator version is persisted and possibly
> WAL logged
> 3) Epoch is switched between 0 and 1 on every coordinator version
> wraparound (extremely rare event)
> 4) Epoch is cluster-wide properly, every node knows current epoch
> 5) As epoch switches, nodes start marking entries from previous epoch as
> "frozen". This is done in a background in a very light-weight mode, as
> there is no rush
> 6) Epoch 0 could not be switched to 1 if there are non-frozen entries from
> previous epoch 1, and vice versa. This way version comparison is always
> possible.
>
> IMO this should do the trick so that we have 8 byte version with virtually
> zero overhead in runtime. But a little distributed magic would be needed.
>
> Thoughts?
>
> P.S.: Idea is borrowed from PostgreSQL counter wraparound handling.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-3478
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message