ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Павлухин Иван <vololo...@gmail.com>
Subject Re: Suggestion to improve deadlock detection
Date Tue, 18 Dec 2018 14:57:25 GMT
Hi folks,

During implementation of edge-chasing deadlock detection algorithm in
scope of [1] it has been realized that basically we have 2 options for
"chasing" strategy. I will use terms Near when GridNearTxLocal is
assumed and Dht when GridDhtTxLocal (tx which updates primary
partition). So, here is 2 mentioned strategies:
1. Send initial probe when Dht tx blocks waiting for another tx to
release a key. Initial probe is sent from waiting Dht tx to Near tx
holding a lock. If receiving Near tx is blocked as well then it relays
the probe to Dht tx it awaits response from. So, the probe is
traveling like Dht -> Near -> Dht -> Near -> ... Let's call the
approach "forward-only".
2. Send probes only between Near transactions. This approach requires
additional request-response call which Near tx issues to Dht node to
check if tx sending a request is waiting for another tx. The call
returns identifier of a Near tx blocking tx sending a request (if
sending tx is in fact blocked). Then waiting Near tx relays a probe to
blocked Near tx. Let's call that approach "lock-checking".

I would like to define several points to consider while comparing
approaches (please point out more if I miss something):
1. Correctness. Especially regarding reporting false deadlocks.
2. Extensibility. Rollback to savepoint and generalization for classic
transactions should be considered.
3. Messaging overhead.
4. Simplicity of implementation.

You can find an implementation of "lock-checking" approach in PR
attached to the ticket [1]. Currently it works correctly only for
transactions obeying strict two-phase locking (2PL). It is fairly easy
to adopt PR to implement "forward-only" approach. Which will work
correct in conditions of 2PL. "lock-checking" algorithm uses 1.5 times
more messages than "forward-only". Implementation of "forward-only"
algorithm is simpler in a sense that it does not require introducing
lock-wait-check messages and future. But as for me it does not look as
big deal. I think that implementation efforts for both approaches are
almost equal so far.

But corner stone here is an extensibility. Let's imagine that we are
going to implement rollback to savepoint. One suggested approach to
extend "forward-only" approach for that case is invalidating sent
probe. Basically, a tx initiated deadlock check assigns unique id to a
probe. And this probe id is invalidated when the tx wakes up from
wait. If that tx receives back a probe with invalidated id it simply
discards it. If id is not invalidated it means a deadlock. But false
deadlock still can be detected. I will provide an example when it does
not work correctly.

Please note, that our transactions can request multiple locks
simultaneously (so-called AND model). Also rollback to savepoint
releases locks obtained after creating a savepoint. Let's use
following notation. "*" means savepoint. "^" means rollback to
previous savepoint. "!" means acquiring a lock. "?" means waiting for
a lock. "&" means requesting multiple locks simultaneously. See the
following transaction execution scenario for 3 transactions and 3
keys:
TA        | TB     |   TC
            |          |   *
            |  1!      |  2!
1? & 3! |           |
            |  2?     |
            |          |  ^
            |          |  3!

We can notice that suggested scenario does not introduce deadlock
because locks are requested in consistent order among all
transactions. But in different moments there exist waiting relations
TA w TB, TB w TC, TC w TA. So, we suspect a false deadlock could be
detected here. Also one can notice that a relation TC w TA is
established after TB w TC is destroyed. Messages on a timeline diagram
demonstrates how "forward-only with probe invalidation" approach can
detect a false deadlock here [2]. On the picture L means "lock", U
"unlock", W "wait". Red cross means reporting a false deadlock.
Intuition here is that while probe is in flight one wait-for edge can
be destroyed (TB w TC) and another one can be created (TC w TA). So,
the approach can report false deadlocks. Also, note that a false
deadlock can be found even when locking order is consistent!

After a while I will describe how "lock-checking" can be adopted to
support rollback to savepoint. Disclaimer here is that it will require
even more messages.
----

I think that we can already start a discussion.
Actually, while false deadlocks can be surprising we perhaps can
tolerate it because our transaction implementation assumes that
transactions can be rolled back by system. "forward-only" approach
requires lesser messages but in big-O terms both approaches have the
same complexity. Both correctness and complexity impact unfortunately
have hardly predictable impact for real workflows. In ideal world
thorough testing could give the answers.

Please, share your vision.

[1] https://issues.apache.org/jira/browse/IGNITE-9322
[2] https://gist.githubusercontent.com/pavlukhin/c8c7c6266eeab56048c31f5cdfb31d20/raw/7d1aef9abcd014ec9fdf69840168768ced638b6c/msg_diagram.jpg
пн, 26 нояб. 2018 г. в 15:27, Павлухин Иван <vololo100@gmail.com>:
>
> Vladimir,
>
> I think it might work. So, if nobody minds I can start prototyping
> edge-chasing approach.
> пн, 26 нояб. 2018 г. в 14:32, Vladimir Ozerov <vozerov@gridgain.com>:
> >
> > Ivan,
> >
> > The problem is that in our system a transaction may wait for N locks
> > simultaneously. This may form complex graphs which spread between many
> > nodes. Now consider that I have a deadlock between 4 nodes: A -> B -> *C*
> > -> D -> A. I've sent a message from a and never reached D because C failed.
> > Well, may be that is good, because some transactions will be rolled back.
> > But when they are rolled back, another cycles from pending multi-way
> > deadlocks may form again. E.g. A -> B -> *E* -> D -> A. So the question
is
> > whether we will be able to detect them reliably. I think that we may use
> > the following assumption:
> > 1) If data node fails, relevant transactions will be rolled-back
> > 2) It means that some other transactions will make at least partial
> > progress with locks acquisition
> >
> > So, we can introduce a kind of counter which will be advanced on every
> > acquired lock on a given node. And define a rule that transaction deadlock
> > request for the given pair (NODE_ID, COUNTER) should be requested at most
> > once. This way, even if specific deadlock check request is lost due to node
> > failure, and another deadlock has formed, then some other node will
> > re-trigger deadlock change request sooner or later, as it's counter
> > advanced.
> >
> > Makes sense?
> >
> > On Sat, Nov 24, 2018 at 5:40 PM Павлухин Иван <vololo100@gmail.com>
wrote:
> >
> > > Hi Vladimir,
> > >
> > > Regarding fault tolerance. It seems that it is not a problem for
> > > edge-chasing approaches. A found deadlock is identified by a message
> > > returned to a detection initiator with initiator's identifier. If
> > > there is no deadlock then such message will not come. If some node
> > > containing a deadlocked transaction fails then it will break the
> > > deadlock. Am I missing something?
> > >
> > > About messaging overhead. Indeed, it looks like edge-chasing
> > > approaches can bring redundant messages. Perhaps, we can borrow some
> > > ideas about optimization from [1]. And I also think about a timeout
> > > before starting a deadlock detection.
> > >
> > > Thoughts about adaptive timeouts lead me to some thoughts about
> > > monitoring. I assume that long waiting for locks signals about not
> > > optimal performance of the system. I think it would be great to have
> > > means of monitoring transactions contention. It could help users to
> > > improve theirs workloads. It also could help us to build some
> > > reasoning how contention correlates with deadlock detection overhead.
> > >
> > > [1] http://mentallandscape.com/Papers_podc84.pdf
> > > вс, 18 нояб. 2018 г. в 10:52, Vladimir Ozerov <vozerov@gridgain.com>:
> > > >
> > > > Hi Ivan,
> > > >
> > > > Great analysis. Agree that edge-chasing looks like better candidate.
> > > First,
> > > > it will be applicable to both normal and MVCC transactions. Second, in
> > > MVCC
> > > > we probably will also need to release some locks when doing rollbacks.
> > > What
> > > > we should think about is failover - what if a node which was in the
> > > middle
> > > > of a graph fails? We need to craft fault-tolerance carefully.
> > > >
> > > > Another critically important point is how to trigger deadlock detection.
> > > My
> > > > concerns about edge-chasing was not about latency, but about a number
of
> > > > messages which travels between nodes. Good algorithm must produce little
> > > to
> > > > no messages in case of normal contenion while still providing protection
> > > in
> > > > case of real deadlocks. So how would we trigger fraph traversal for the
> > > > given transaction? May be we can start with hard timeout and then employ
> > > a
> > > > kind of adaptive increment in case high rate of false-positives are
> > > > observed. May be something else.
> > > >
> > > > On Sun, Nov 18, 2018 at 10:21 AM Павлухин Иван <vololo100@gmail.com>
> > > wrote:
> > > >
> > > > > Vladimir,
> > > > >
> > > > > Thanks for the articles! I studied them and a couple of others. And
I
> > > > > would like to share a knowledge I found.
> > > > >
> > > > > BACKGROUND
> > > > > First of all our algorithm implemented in
> > > > >
> > > > >
> > > org.apache.ignite.internal.processors.cache.transactions.TxDeadlockDetection
> > > > > is not an edge-chasing algorithm. In essence a lock-waiting
> > > > > transaction site polls nodes responsible for keys of interest one
by
> > > > > one and reconstructs global wait-for graph (WFG) locally.
> > > > > A centralized algorithm discussed in this thread looks similar to
> > > > > algorithms proposed by Ho [1]. The simplest of them (two-phased)
> > > > > reports false deadlocks when unlock before transaction finish is
> > > > > permitted. So, it seems that it only works when strict two-phase
> > > > > locking (2PL) is obeyed. Another one (called one-phased) requires
> > > > > tracking all locked keys by each transaction which is not desirable
> > > > > for MVCC transactions.
> > > > > Aforementioned edge-chasing algorithm by Chandy, Misra and Haas [2]
is
> > > > > proven to work even when 2PL is not obeyed.
> > > > > Also performance target is not clear for me. It looks like centralized
> > > > > approaches can use fewer messages and provide lower latency that
> > > > > distributed ones. But would like to understand what are the orders
of
> > > > > latency and messaging overhead. Many of algorithms are described
in
> > > > > [3] and some performance details are mentioned. It is said "As per
> > > > > [GRAY81], most deadlocks involve two to four transactions." I see
one
> > > > > more argument making deadlock detection algorithm latency not so
> > > > > important. We can consider deadlock as unavoidable situation, but
even
> > > > > lightning fast detector will not save a database from performance
> > > > > degradation in case of tons of deadlocks.
> > > > > What is also interesting, Google Cloud Spanner employs simple
> > > > > "wound-wait" deadlock prevention [4] instead of any detection
> > > > > algorithm.
> > > > >
> > > > > DISCUSSION
> > > > > So, I see the following options:
> > > > > 1. Edge-chasing algorithm like Chandy, Misra and Haas [2].
> > > > > 2. Centralized two-phased algorithm described by Ho [1] with
> > > > > restriction that transactions must obey 2PL.
> > > > > Personally, I tend to edge-chasing approach because it looks simple
> > > > > and universal. Restricting ourselves with 2PL will restrict us in
> > > > > features which could be implemented in future (e.g. transaction
> > > > > savepoints), will not it?
> > > > >
> > > > > WDYT?
> > > > >
> > > > > Ivan
> > > > >
> > > > > [1] https://turing.cs.hbg.psu.edu/comp512.papers/HoRamamoorthy-82.pdf
> > > > > [2]
> > > > >
> > > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > [3] https://www.ics.uci.edu/~cs237/reading/p37-elmagarmid.pdf
> > > > > [4]
> > > > >
> > > https://cloud.google.com/spanner/docs/transactions#rw_transaction_performance
> > > > >
> > > > > ср, 14 нояб. 2018 г. в 22:13, Vladimir Ozerov <vozerov@gridgain.com>:
> > > > > >
> > > > > > Ivan,
> > > > > >
> > > > > > This is interesting question. I think we should spend some time
for
> > > > > formal
> > > > > > verification whether this algorithm works or not. Several articles
> > > you
> > > > > may
> > > > > > use as a startitng point: [1], [2]. From what I understand,
Ignite
> > > fall
> > > > > > into "AND" model, and currently implemented algorithm is a variation
> > > of
> > > > > > "edge-chasing" approach as per Chandy, Misra and Haas [3], which
is
> > > > > *proven
> > > > > > to be correct* in that it both detects deadlocks when they are
> > > present,
> > > > > and
> > > > > > do not produce false positives. But is might be too heavy for
the
> > > system
> > > > > > under contention.
> > > > > >
> > > > > > We need to search for any formal proof of correctness of proposed
> > > > > > algorithm. This area is already researched throughly enough,
so we
> > > should
> > > > > > be able to find an answer quickly.
> > > > > >
> > > > > > Vladimir.
> > > > > >
> > > > > > [1] http://www.cse.scu.edu/~jholliday/dd_9_16.htm
> > > > > > [2] https://www.cs.uic.edu/~ajayk/Chapter10.pdf
> > > > > > [3]
> > > > > >
> > > > >
> > > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > >
> > > > > > On Wed, Nov 14, 2018 at 6:55 PM Павлухин Иван <vololo100@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Next part as promised. A working item for me is a deadlock
detector
> > > > > > > for MVCC transactions [1]. The message is structured in
2 parts.
> > > First
> > > > > > > is an analysis of the current state of affairs and possible
> > > options to
> > > > > > > go. Second is a proposed option. First part is going to
be not so
> > > > > > > short so some might prefer to skip it.
> > > > > > >
> > > > > > > ANALYSIS
> > > > > > > The immediate question is "why we cannot use an existing
deadlock
> > > > > > > detector?". The differences between classic and MVCC transactions
> > > > > > > implementation is the answer. Currently a collection of
> > > IgniteTxEntry
> > > > > > > is used for detection. But such collection is not maintained
for
> > > MVCC
> > > > > > > transactions. So, it will not work out of box.
> > > > > > > Also it looks like that current distributed iterative approach
> > > cannot
> > > > > > > be low latency it the worst case because of doing possibly
many
> > > > > > > network requests sequentially.
> > > > > > > So, what options do we have? Generally we should choose
between
> > > > > > > centralized and distributed approaches. By centralized
approach I
> > > mean
> > > > > > > existence of a dedicated deadlock detector located on a
single
> > > node.
> > > > > > > In the centralized approach we can face difficulties related
to
> > > > > > > failover as a node running deadlock detector can fail.
In the
> > > > > > > distributed approach extra network messaging overhead can
strike
> > > > > > > because different nodes participating in a deadlock can
start
> > > > > > > detection independently and send redundant messages. I
see some
> > > > > > > aspects which make sense for choosing implementation. Here
they are
> > > > > > > with an approach that is better (roughly speaking) in parentheses:
> > > > > > > * Detection latency (centralized).
> > > > > > > * Messaging overhead (centralized).
> > > > > > > * Failover (distributed).
> > > > > > > And also having a park of deadlock detectors sounds not
very good.
> > > I
> > > > > > > hope that it is possible to develop a common solution suitable
for
> > > > > > > both kinds of transactions. I suggest to pilot new solution
with
> > > MVCC
> > > > > > > and then adopt it for classic transactions.
> > > > > > >
> > > > > > > PROPOSAL
> > > > > > > Actually I propose to start with an centralized algorithm
> > > described by
> > > > > > > Vladimir in the beginning of the thread. I will try to
outline main
> > > > > > > points of it.
> > > > > > > 1. Single deadlock detector exists in the cluster which
maintains
> > > > > > > transaction wait-for graph (WFG).
> > > > > > > 2. Each cluster node sends and invalidates wait-for edges
to the
> > > > > detector.
> > > > > > > 3. The detector periodically searches cycles in WFG and
chooses and
> > > > > > > aborts a victim transaction if cycle is found.
> > > > > > >
> > > > > > > Currently I have one fundamental question. Is there a possibility
> > > of
> > > > > > > false detected deadlocks because of concurrent WFG updates?
> > > > > > > Of course there are many points of improvements and optimizations.
> > > But
> > > > > > > I would like to start from discussing key points.
> > > > > > >
> > > > > > > Please share your thoughts!
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > > > > > ср, 14 нояб. 2018 г. в 15:47, ipavlukhin <vololo100@gmail.com>:
> > > > > > > >
> > > > > > > > Hi Igniters,
> > > > > > > >
> > > > > > > > I would like to resume the discussion about a deadlock
detector.
> > > I
> > > > > start
> > > > > > > > with a motivation for a further work on a subject.
As I see
> > > current
> > > > > > > > implementation (entry point IgniteTxManager.detectDeadlock)
> > > starts a
> > > > > > > > detection only after a transaction was timed out.
In my mind it
> > > is
> > > > > not
> > > > > > > > very good from a product usability standpoint. As
you know, in a
> > > > > > > > situation of deadlock some keys become non-usable
for an infinite
> > > > > amount
> > > > > > > > of time. Currently the only way to work around it
is configuring
> > > a
> > > > > > > > timeout, but it could be rather tricky in practice
to choose a
> > > > > > > > proper/universal value for it. So, I see the main
point as:
> > > > > > > >
> > > > > > > > Ability to break deadlocks without a need to configure
timeouts
> > > > > > > explicitly.
> > > > > > > >
> > > > > > > > I will return soon with some thoughts about implementation.
> > > > > Meanwhile,
> > > > > > > > does anybody have in mind any other usability points
which I am
> > > > > missing?
> > > > > > > > Or is there any alternative approaches?
> > > > > > > >
> > > > > > > > On 2017/11/21 08:32:02, Dmitriy Setrakyan <d...@apache.org>
> > > wrote:
> > > > > > > >  > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov
<
> > > > > vo...@gridgain.com
> > > > > > > >>
> > > > > > > >  > wrote:>
> > > > > > > >  >
> > > > > > > >  > > It doesn’t need all txes. Instead, other
nodes will send
> > > info
> > > > > about>
> > > > > > > >  > > suspicious txes to it from time to time.>
> > > > > > > >  > >>
> > > > > > > >  >
> > > > > > > >  > I see your point, I think it might work.>
> > > > > > > >  >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > > Ivan Pavlukhin
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ivan Pavlukhin
> > > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ivan Pavlukhin
> > >
>
>
>
> --
> Best regards,
> Ivan Pavlukhin



-- 
Best regards,
Ivan Pavlukhin

Mime
View raw message