flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: Checkpoint was declined (tasks not ready)
Date Wed, 25 Oct 2017 09:07:34 GMT
Hi Bartek,

I think your explanation of the problem is correct. Thanks a lot for your
investigation.

What we could do to solve the problem is the following:

Either) We start the emitter thread before we restore the elements in the
open method. That way the open method won't block forever but only until
the first element has been emitted downstream.

or) Don't accept a pendingStreamElementQueueEntry by waiting in the
processElement function until we have capacity left again in the queue.

What do you think?

Do you want to contribute the fix for this problem?

Cheers,
Till

On Mon, Oct 23, 2017 at 4:30 PM, bartektartanus <bartektartanus@gmail.com>
wrote:

> Ok, looks like we've found the cause of this issue. The scenario looks like
> this:
> 1. The queue is full (let's assume that its capacity is N elements)
> 2. There is some pending element waiting, so the
> pendingStreamElementQueueEntry field in AsyncWaitOperator is not null and
> while-loop in addAsyncBufferEntry method is trying to add this element to
> the queue (but element is not added because queue is full)
> 3. Now the snapshot is taken - the whole queue of N elements is being
> written into the ListState in snapshotState method and also (what is more
> important) this pendingStreamElementQueueEntry is written to this list too.
> 4. The process is being restarted, so it tries to recover all the elements
> and put them again into the queue, but the list of recovered elements hold
> N+1 element and our queue capacity is only N. Process is not started yet,
> so
> it can not process any element and this one element is waiting endlessly.
> But it's never added and the process will never process anything. Deadlock.
> 5. Trigger is fired and indeed discarded because the process is not running
> yet.
>
> If something is unclear in my description - please let me know. We will
> also
> try to reproduce this bug in some unit test and then report Jira issue.
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>

Mime
View raw message