ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Ozerov <voze...@gridgain.com>
Subject Re: New definition for affinity node (issues with baseline)
Date Tue, 24 Apr 2018 20:13:40 GMT
Right, as far as I understand we are not arguing on whether BLT is needed
or not. The main questions are how to properly deliver this feature to
users and how to deal with co-location issues between persistent and
non-persistent caches. Looks like change policies are the way to go for the
first question.

As far as co-location, it is important to note that different affinity
distribution for in-memory and persistent caches automatically means that
we loose SQL joins and predictable behavior of any affinity-based
operations. It means that if we calculated the same affinity for persistent
and in-memory caches at some point, we cannot re-distribute in-memory
caches differently if some nodes go down without breaking co-located
computations, am I right?

On Tue, Apr 24, 2018 at 10:19 PM, Alexey Goncharuk <
alexey.goncharuk@gmail.com> wrote:

> Well, this means that the concept of baseline is still needed because we
> must not reassign partitions immediately (note that this is not identical
> to rebalance delay!). The approach you describe is identical to baseline
> change policies and I have nothing against this, their implementation was
> planned to phase II of baseline changes.
>
> 2018-04-24 21:31 GMT+03:00 Vladimir Ozerov <vozerov@gridgain.com>:
>
> > Alex,
> >
> > CockroachDB is based on RAFT and is able to repair itself automatically
> [1]
> > [2]. Their approach looks reasonable to me and is pretty much similar to
> > MongoDB and Cassandra. In short, you distinguish between short-term and
> > long-term failures.
> > 1) First, you wait for small time window in hope that it was a network
> > glitch or restart. Even if this was a segmentation, with true consensus
> > algorithm this is not an issue - you partitions or the whole cluster is
> > unavailable during this window.
> > 2) Then, if majority is still there and cluster is operational you
> trigger
> > automatic rebalance.
> > 3) Last, if you need fine-grained control you can tune or disable
> > auto-rebalance and do some manual magic.
> >
> > This is very nice approach: it is simple for simple use cases and complex
> > for complex use cases. Ideally, this is how Ignite should work. Want to
> > play and write hello-world app? Just learn what cache is. Started
> > developing moderately complex application? Learn about affinity, cache
> > modes, etc.. Going to enterprise scale? Learn about BLAT, activation,
> etc..
> >
> > It seems that old behavior without BLAT and even without manual
> activation
> > would be enough for majority of our users. At the very least it is enough
> > for order of magnitude more popular Cassandra and MongoDB.
> >
> > [1]
> > https://www.cockroachlabs.com/docs/stable/frequently-asked-
> > questions.html#how-does-cockroachdb-survive-failures
> > [2]
> > https://www.cockroachlabs.com/docs/stable/training/fault-
> > tolerance-and-automated-repair.html
> >
> > On Tue, Apr 24, 2018 at 7:55 PM, Alexey Goncharuk <
> > alexey.goncharuk@gmail.com> wrote:
> >
> > > Vladimir,
> > >
> > > Automatic cluster membership changes may be implemented to grow the
> > > topology, but auto-shrinking topology is usually not possible because a
> > > process cannot distinguish between a node shutdown and network
> > > partitioning. If we want to deal with split-brain scenarios as a
> grown-up
> > > system, we should change the replication strategy within partitions to
> a
> > > consensus algorithm (I really hope we will). None of the consensus
> > > algorithms (at least known to me - paxos, raft, ZAB) do auto cluster
> > > adjustments based on a internally-detected process failure. I consider
> > > baseline topology as a step towards this model.
> > >
> > > Addressing your second concern, If a node was down for a short period
> of
> > > time, we should (and we do) rebalance only deltas, which is faster than
> > > erasing the whole node and moving all data from scratch.
> > >
> > > 2018-04-24 19:42 GMT+03:00 Vladimir Ozerov <vozerov@gridgain.com>:
> > >
> > > > Ivan,
> > > >
> > > > This reasoning sounds questionable to me. First, separate logic for
> in
> > > > memory and persistent regions means that we loose collocation between
> > > > persistent and non persistent caches. Second, “data is still on disk”
> > > > assumption might be not valid if node has left due to disk crash, or
> > when
> > > > data is updated on remaining nodes.
> > > >
> > > > вт, 24 апр. 2018 г. в 19:21, Ivan Rakov <ivan.glukos@gmail.com>:
> > > >
> > > > > Stan,
> > > > >
> > > > > I believe it was discussed at the design proposal thread:
> > > > >
> > > > > http://apache-ignite-developers.2346864.n4.nabble.
> > > > com/Cluster-auto-activation-design-proposal-td20295.html
> > > > >
> > > > > The short answer: backup factor decreases if node leaves. In
> > > > > non-persistent mode we have to rebalance data ASAP - otherwise last
> > > node
> > > > > that owns partition may fail and data will be lost forever.
> > > > > This is not necessary if data is persisted to disk storage, that's
> > the
> > > > > reason for Baseline Topology concept.
> > > > >
> > > > > Best Regards,
> > > > > Ivan Rakov
> > > > >
> > > > > On 24.04.2018 18:48, Stanislav Lukyanov wrote:
> > > > > > + for Vladimir's point - adding more complexity may (and likely
> > will)
> > > > be
> > > > > > even more misleading.
> > > > > >
> > > > > > Can we take a step back and discuss why do we need to have
> > different
> > > > > > behavior for persistent and in-memory caches? Can we make
> in-memory
> > > > > caches
> > > > > > honor baseline instead of special-casing them?
> > > > > >
> > > > > > Thanks,
> > > > > > Stan
> > > > > >
> > > > > >
> > > > > > вт, 24 апр. 2018 г., 18:28 Vladimir Ozerov <vozerov@gridgain.com
> >:
> > > > > >
> > > > > >> Guys,
> > > > > >>
> > > > > >> As a user I definitely do not want to think about BLATs,
SATs,
> > DATs,
> > > > > >> whatsoever. I want to query data, iterate over data, send
> compute
> > > > tasks
> > > > > to
> > > > > >> data. If certain node is outside of BLAT and do not have
data,
> > then
> > > > > this is
> > > > > >> not affinity node. Can we just fix affinity logic to take
in
> count
> > > > BLAT
> > > > > >> appropriately?
> > > > > >>
> > > > > >> On Tue, Apr 24, 2018 at 6:12 PM, Ivan Rakov <
> > ivan.glukos@gmail.com>
> > > > > wrote:
> > > > > >>
> > > > > >>> Eduard,
> > > > > >>>
> > > > > >>> Can you please summarize code changes that you are proposing?
> > > > > >>> I agree that BLT is a bit misleading term and DAT/SAT
make more
> > > > sense.
> > > > > >>> However, establishing a consensus on v2.4 Baseline Topology
> > > > terminology
> > > > > >>> took a long time and seems like you are going to cause
a bit
> more
> > > > > >>> perturbations.
> > > > > >>> I still don't understand what and how should be changed.
Please
> > > > provide
> > > > > >>> summary of upcoming class renamings and changes of existing
> > system
> > > > > parts.
> > > > > >>>
> > > > > >>> Best Regards,
> > > > > >>> Ivan Rakov
> > > > > >>>
> > > > > >>>
> > > > > >>> On 24.04.2018 17:46, Eduard Shangareev wrote:
> > > > > >>>
> > > > > >>>> Hi, Igniters,
> > > > > >>>>
> > > > > >>>> I want to raise a topic about our affinity node
definition.
> > > > > >>>>
> > > > > >>>> After adding baseline (affinity) topology (BL(A)T)
things
> start
> > > > being
> > > > > >>>> complicated.
> > > > > >>>>
> > > > > >>>> Plenty of bugs appears:
> > > > > >>>>
> > > > > >>>> IGNITE-8173
> > > > > >>>> ignite.getOrCreateCache(cacheConfig).iterator()
method works
> > > > incorrect
> > > > > >>>> for
> > > > > >>>> replicated cache in case if some data node isn't
in baseline
> > > > > >>>>
> > > > > >>>> IGNITE-7628
> > > > > >>>> SqlQuery hangs indefinitely with additional not
registered in
> > > > baseline
> > > > > >>>> node.
> > > > > >>>>
> > > > > >>>> It's because everything relies on concept "affinity
node".
> > > > > >>>> And until now it was as simple as a server node
which passes
> > node
> > > > > >> filter.
> > > > > >>>> Other words any server node which is not filtered
out by node
> > > > filter.
> > > > > >>>>
> > > > > >>>> But node which is not in BL(A)T and which passes
node filter
> > would
> > > > be
> > > > > >>>> treated as affinity node. And it's definitely wrong.
At least,
> > it
> > > > is a
> > > > > >>>> source of many bugs (I believe there are much more
than those
> 2
> > > > which
> > > > > I
> > > > > >>>> already have mentioned).
> > > > > >>>>
> > > > > >>>> It's clear that this definition should be changed.
> > > > > >>>> Let's start with a new definition of "Affinity topology".
> > Affinity
> > > > > >>>> topology
> > > > > >>>> is a set of nodes which potentially could keep data.
> > > > > >>>>
> > > > > >>>> If we use knowledge about the current realization
we can say
> > that
> > > 1.
> > > > > for
> > > > > >>>> in-memory cache groups it would be all server nodes;
> > > > > >>>> 2. for persistent cache groups it would be BL(A)T.
> > > > > >>>>
> > > > > >>>> I will further use Dynamic Affinity Topology or
DAT for 1
> > > (in-memory
> > > > > >> cache
> > > > > >>>> groups) and Static Affinity Topology or SAT instead
BL(A)T, or
> > 2nd
> > > > > >> point.
> > > > > >>>> Denote node filter as f(X), where X is affinity
topology.
> > > > > >>>>
> > > > > >>>> Then we can say that node A is affinity node if
> > > > > >>>> A ∈ AT', where AT' = f(AT), where AT is DAT or
SAT.
> > > > > >>>>
> > > > > >>>> It worth to mention that AT' should be used to pass
to
> affinity
> > > > > function
> > > > > >>>> of
> > > > > >>>> cache groups.
> > > > > >>>> Also, AT and AT' could change during the time (BL(A)T
changes
> or
> > > > node
> > > > > >>>> joins/disconnections).
> > > > > >>>>
> > > > > >>>> And I don't like fact that usage of DAT or SAT relies
on
> > > persistence
> > > > > >>>> settings (Should we make it configurable per cache
group?).
> > > > > >>>>
> > > > > >>>> Ok, I have created a ticket to implement this changes
and will
> > > start
> > > > > >>>> working on it.
> > > > > >>>> https://issues.apache.org/jira/browse/IGNITE-8380
(Affinity
> > node
> > > > > >>>> calculation doesn't take into account BLT).
> > > > > >>>>
> > > > > >>>> Also, I want to use these definitions (Affinity
Topology,
> > Affinity
> > > > > Node,
> > > > > >>>> DAT, SAT) in documentation and java docs.
> > > > > >>>>
> > > > > >>>> Maybe, we also should consider replacing BL(A)T
with SAT.
> > > > > >>>>
> > > > > >>>> Thank you for your attention.
> > > > > >>>>
> > > > > >>>>
> > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message