ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Ozerov <voze...@gridgain.com>
Subject Re: MVCC version format
Date Fri, 15 Dec 2017 12:28:44 GMT
Hi Alex,

Good points.

1) So far we've seen clusters up to ~ 2K nodes in size, and typical size is
much smaller. So this doesn't appear to be a a big problem in common case.
But what if at some point we see, say, 10_000 nodes deployment? in this
case proposed schema may virtually stop working at all :-) On the other
hand, frequent wraparound typically mean that not so much changes happened
in between, so we should be able to cope with freeze fast enough. Also big
deployments would have to take coordinator location in count anyway,
because on coordinator change we cannot start assigning versions
immediately because we need to collect full list of active transactions
from other nodes first, which is painful on large topologies. I.e our
current approach would require careful administration irrespectively of
whether we take proposed approach or not.

2) I think we could simply mark this partition as "definitely frozen"
because when offline partition comes back any active transaction would
definitely see it's data. Would that work?

3) This depends on final implementation heavily. We should do that
constantly in background and track progress on per-partition and possibly
per-block (visibility map?) levels. Provided large time frames, we should
be able to cope with.

On Fri, Dec 15, 2017 at 2:37 PM, Alexey Goncharuk <
alexey.goncharuk@gmail.com> wrote:

> Vladimir,
>
> Although at the first glance this seems to be ok for a local solution, I
> have a few concerns wrt distributed version behavior:
>
> 1) I cannot say that epoch switch in this approach will be an extremely
> rare event. Let's say a user regularly deploys a new version of application
> code to a large Ignite cluster and restarts Ignite nodes for such an
> upgrade. If version coordinator is not dedicated, then on average version
> coordinator switch will occur N/2 times (N is the size of the cluster), in
> the worst case the switch will happen N times. At this rate, epoch switch
> may occur maybe once a month depending on a cluster size? The user will
> need to take care of the node order restart.
> 2) What happens if a node went offline, then epoch switched twice, then
> node gets back to cluster again (a classic ABA)? The only solution I see
> here is that we track epoch switches and drop nodes that were offline
> during the epoch switch forever. This may cause problems if a user has
> partition loss policy = READ_WRITE_SAFE. Let's say a partition was offline,
> but updates to other partitions were allowed. If an epoch switch happens at
> this moment, the partition is lost forever.
> 3) How fast will we be able to flip the frozen bit for the whole dataset if
> we are looking at terabytes per node? Given (1), we may be in trouble here.
>
> 2017-12-15 12:57 GMT+03:00 Vladimir Ozerov <vozerov@gridgain.com>:
>
> > Igniters,
> >
> > As you probably know, we are working on MVCC feature [1]. When MVCC is
> > enabled every cache operation/tx require one unique version for read and
> > possibly one version for write. Versions are assigned by a single node
> > called *coordinator*.
> >
> > Requirements for version:
> > 1) Must be increasing and comparable, so that we understand which
> operation
> > occurred earlier.
> > 2) Support coordinator failtover
> > 3) As compact as possible
> >
> > Currently we implemented naive approach where version is represented as
> two
> > long values (16 bytes total):
> > 1) Coordinator version (a kind of mix of a timestamp and topology
> version)
> > 2) Local coordinator counter.
> >
> > Other vendors typically assign 4 byte (long) or 8 byte (long) versions,
> > which is definitely better as they are faster to compare and require less
> > space, what could be especially important for small values, such as
> > indexes.
> >
> > I thought a bit, and came to conclusion that we could fit into 64 bit
> > version (8 bytes) as follows:
> > [0..47] - local coordinator counter; this gives capacity 10^14, which
> > should be enough for any sensible workload
> > [48..60] - coordinator version, which is incremented on every coordinator
> > change (relatively rare) or local counter wraparound (very rare); this
> > gives 8192 versions
> > [61] - coordinator epoch
> > [62] - delete marker, set on entry removal, cleared on resurrection (e.g.
> > TX rollback, or index change A->B->A)
> > [63] - freeze marker
> >
> > Mechanics:
> > 1) Every coordinator change increments coordinator version and share it
> via
> > discovery
> > 2) If persistence is enabled, coordinator version is persisted and
> possibly
> > WAL logged
> > 3) Epoch is switched between 0 and 1 on every coordinator version
> > wraparound (extremely rare event)
> > 4) Epoch is cluster-wide properly, every node knows current epoch
> > 5) As epoch switches, nodes start marking entries from previous epoch as
> > "frozen". This is done in a background in a very light-weight mode, as
> > there is no rush
> > 6) Epoch 0 could not be switched to 1 if there are non-frozen entries
> from
> > previous epoch 1, and vice versa. This way version comparison is always
> > possible.
> >
> > IMO this should do the trick so that we have 8 byte version with
> virtually
> > zero overhead in runtime. But a little distributed magic would be needed.
> >
> > Thoughts?
> >
> > P.S.: Idea is borrowed from PostgreSQL counter wraparound handling.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-3478
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message