flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: All but one TMs connect when JM has more than 16G of memory
Date Wed, 30 Sep 2015 15:19:27 GMT
Hi Robert,

the problem here is that YARN's scheduler (there are different schedulers
in YARN: FIFO, CapacityScheduler, ...) is not giving Flink's
ApplicationMaster/JobManager all the containers it is requesting. By
increasing the size of the AM/JM container, there is probably no memory
left to fit the last TaskManager container.
I also experienced this issue, when I wanted to run a Flink job on YARN and
the containers were fitting theoretically, but YARN was not giving me all
the containers I requested.
Back then, I asked on the yarn-dev list [1] (there were also some off-list
emails) but we could not resolve the issue.

Can you check the resource manager logs? Maybe there is a log message which
explains why the container request of Flink's AM is not fulfilled.


[1]
http://search-hadoop.com/m/AsBtCilK5r1pKLjf1&subj=Re+QUESTION+Allocating+a+full+YARN+cluster

On Wed, Sep 30, 2015 at 5:02 PM, Robert Schmidtke <ro.schmidtke@gmail.com>
wrote:

> It's me again. This is a strange issue, I hope I managed to find the right
> keywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of
> memory each.
>
> When running my job like so:
>
> $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -yn 7 .....
>
> The job completes without any problems. When running it like so:
>
> $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16385 -ytm 40960 -yn 7 .....
>
> (note the one more M of memory for the JM), the execution stalls,
> continuously reporting:
>
> .....
> TaskManager status (6/7)
> TaskManager status (6/7)
> TaskManager status (6/7)
> .....
>
> I did some poking around, but I couldn't find any direct correlation with
> the code.
>
> The JM log says:
>
> .....
> 16:49:01,893 INFO  org.apache.flink.yarn.ApplicationMaster$
>        -  JVM Options:
> 16:49:01,893 INFO  org.apache.flink.yarn.ApplicationMaster$
>        -     -Xmx12289M
> .....
>
> but then continues to report
>
> .....
> 16:52:59,311 INFO
>  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user
> requested 7 containers, 6 running. 1 containers missing
> 16:52:59,831 INFO
>  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user
> requested 7 containers, 6 running. 1 containers missing
> 16:53:00,351 INFO
>  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user
> requested 7 containers, 6 running. 1 containers missing
> .....
>
> forever until I cancel the job.
>
> If you have any ideas I'm happy to try them out. Thanks in advance for any
> hints! Cheers.
>
> Robert
> --
> My GPG Key ID: 336E2680
>

Mime
View raw message