flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shannon Carey <sca...@expedia.com>
Subject Yarn terminating TM for pmem limit cascades causing all jobs to fail
Date Tue, 18 Apr 2017 22:26:21 GMT
I'm on Flink 1.1.4. We had yet another occurrence of Yarn killing a TM due to exceeding pmem
limits and all jobs failing as a result. I thought I had successfully disabled that check,
but apparently the property doesn't work as expected in EMR.

From what I can tell in the logs, it looks like after the first TM was killed by Yarn, the
jobs failed and were retried. However, when they are retried they cause increased pmem load
on yet another TM, which results in Yarn killing another TM. That caused the jobs to fail
again. This happened 5 times until our job retry policy gave up and allowed the jobs to fail
permanently. Obviously, this situation is very problematic because it results in the loss
of all job state, plus it requires manual intervention to start the jobs again.

The job retries eventually fail due to, "Could not restart the job ... The slot in which the
task was executed has been released. Probably loss of TaskManager" or due to "Could not restart
the job … Connection unexpectedly closed by remote task manager … This might indicate
that the remote task manager was lost." Those are only the final failure causes: Flink does
not appear to log the cause of intermediate restart failures.

I assume that the messages logged from the JobManager about "Association with remote system
… has failed, address is now gated for [5000] ms. Reason is: [Disassociated]." is due to
the TM failing, and is expected/harmless?

It seems like disabling the pmem check will fix this problem, but I am wondering if this is
related: https://flink.apache.org/faq.html#the-slot-allocated-for-my-task-manager-has-been-released-what-should-i-do
? I don't see any log messages about quarantined TMs…

Do you think that increasing the # of job retries so that the jobs don't fail until all TMs
are replaced with fresh ones fix this issue? The "memory.percent-free" metric from Collectd
did go down to 2-3% on the TMs before they failed, and shot back up to 30-40% on TM restart
(though I'm not sure how much of that had to do with the loss of state).  So, memory usage
may be a real problem, but we don't get an OOM exception so I'm not sure we can control this
from the JVM perspective. Are there other memory adjustments we should make which would allow
our TMs to run for long periods of time without having this problem? Is there perhaps a memory
leak in RocksDB?

Thanks for any help you can provide,
View raw message