Hi Fabian and till,

Below is what I've observed today.
Hope it provides a strong evidence to figure out the problem.

I attach another log file, jmlog2.txt, after observing the different behavior of a per-job cluster with more memory given to YARN nodemanagers (compared to jmlog.txt).
- jmlog.txt : Each of 7 NodeManagers has 96GB. Only a single TM (50GB) can be scheduled on a NM and I ended up with having only 7 NodeManagers. There's no room for extra unnecessary TaskManagers. 
- jmlog2.txt : Each of 7 NMs has 128GB. After scheduling a TM on each NM, RM can schedule additional 7 TMs as each NM has remaining 78 GB.

What I see from both log files is that, 
- ExecutionGraph creates 100 tasks as I specified.
- Initially 7 necessary containers (for 7 TMs each with 16 slots) are requested to YARN, which is quite desired behavior.
- However, extra unnecessary 93 requests are made after the very first TaskManager is registered to SlotManager with the following messages:
+ jmlog.txt : Register TaskManager 640b098f3a132b452a74673631a0bf7f at the SlotManager.
+ jmlog2.txt : Registering TaskManager container_1525676778566_0001_01_000005 under 35b9ceed32bd87fa23ddca4282f5abac at the SlotManager.
(Please note that the info messages are different in jmlog.txt and jmlog2.txt; it is due to a recent hotfix "Add resourceId to TaskManager registration messages")
The 93 containers should not be asked as JobMaster is going to have enough slots on the 6 TaskManagers which will be soon registered to SlotManager.
This causes a deadlock situation if YARN does not have resources to allocate such 93 containers as in jmlog.txt.

Unlike in jmlog.txt, jmlog2.txt shows
- Extra TMs are scheduled on newly scheduled containers.
- Extra TMs are not given any tasks for while.
- Extra TMs are shut down with the below message.
    "Closing TaskExecutor connection container_1525676778566_0001_01_000015 because: TaskExecutor exceeded the idle timeout."
- At the end, there are no pending container requests in jmlog2.txt at the end.

p.s. I just found that SlotManager is only for flip-6. Nevertheless, I write this email to user@ as I originally start this thread on user@. Sorry for the inconvenience.

- Dongwon

On Mon, May 7, 2018 at 9:27 PM, Fabian Hueske <fhueske@gmail.com> wrote:
Hi Dongwon,

I see that you are using the latest master (Flink 1.6-SNAPSHOT).
This is a known problem in the new FLIP-6 mode. The ResourceManager tries to allocate too many resources, basically on TM per required slot, i.e., it does not take the number of slots per TM into account.
The resources are not used and should be returned to YARN after a timeout.

I couldn't find a JIRA issue to point you to.
Till (in CC) should know more details about this problem.

Best, Fabian

2018-05-05 12:50 GMT+02:00 Dongwon Kim <eastcirclek@gmail.com>:
I'm testing per-job cluster on YARN.

I just need to launch 7 TMs each with 50GB memory (total 350GB) but Flink makes more resource request to YARN than necessary.

All of the remaining memory in YARN, around 370GB, are reserved by the Flink job, which I can check in YARN UI.

The remaining memory is not used but reserved; that’s very weird.

Attached is JM log.

Any help would be greatly appreciated!


- Dongwon