flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shannon Carey <sca...@expedia.com>
Subject Rapidly failing job eventually causes "Not enough free slots"
Date Thu, 05 Jan 2017 00:21:16 GMT
In Flink 1.1.3 on emr-5.2.0, I've experienced a particular problem twice and I'm wondering
if anyone has some insight about it.

In both cases, we deployed a job that fails very frequently (within 15s-1m of launch). Eventually,
the Flink cluster dies.

The sequence of events looks something like this:

  *   bad job is launched
  *   bad job fails & is restarted many times (I didn't have the "failure-rate" restart
strategy configuration right)
  *   Task manager logs: org.apache.flink.yarn.YarnTaskManagerRunner (SIGTERM handler): RECEIVED
SIGNAL 15: SIGTERM. Shutting down as requested.
  *   At this point, the YARN resource manager also logs the container failure
  *   More logs: Container ResourceID{resourceId='container_1481658997383_0003_01_000013'}
failed. Exit status: Pmem limit exceeded (-104)
Diagnostics for container ResourceID{resourceId='container_1481658997383_0003_01_000013'}
in state COMPLETE : exitStatus=Pmem limit exceeded (-104) diagnostics=Container [pid=21246,containerID=container_1481658997383_0003_01_000013]
is running beyond physical memory limits. Current usage: 5.6 GB of 5.6 GB physical memory
used; 9.6 GB of 28.1 GB virtual memory used. Killing container.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Total number of failed containers so far: 12
Stopping YARN session because the number of failed containers (12) exceeded the maximum failed
containers (11). This number is controlled by the 'yarn.maximum-failed-containers' configuration
setting. By default its the number of requested containers.
  *   From here onward, the logs repeatedly show that jobs fail to restart due to "org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Not enough free slots available to run the job. You can decrease the operator parallelism
or increase the number of slots per TaskManager in the configuration. Task to schedule: <
Attempt #68 (Source: …) @ (unassigned) - [SCHEDULED] > with groupID < 73191c171abfff61fb5102c161274145
> in sharing group < SlotSharingGroup [73191c171abfff61fb5102c161274145, 19596f7834805c8409c419f0edab1f1b]
>. Resources available to scheduler: Number of instances=0, total number of slots=0, available
  *   Eventually, Flink stops for some reason (with another SIGTERM message), presumably because

Does anyone have an idea why a bad job repeatedly failing would eventually result in the Flink
cluster dying?

Any idea why I'd get "Pmem limit exceeded" or "Not enough free slots available to run the
job"? The JVM heap usage and the free memory on the machines both look reasonable in my monitoring
dashboards. Could it possibly be a memory leak due to classloading or something?

Thanks for any help or suggestions you can provide! I am hoping that the "failure-rate" restart
strategy will help avoid this issue in the future, but I'd also like to understand what's
making the cluster die so that I can prevent it.

View raw message