flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maximilian Michels <...@apache.org>
Subject Re: Master (1.1-SNAPSHOT) Can't run on YARN
Date Thu, 21 Apr 2016 14:42:27 GMT
Hi Stefano,

Thanks for reporting. I wasn't able to reproduce the problem. I ran
./bin/yarn-session.sh -n 1 -s 2 -jm 2048 -tm 2048 on a Yarn cluster
and it created a Flink cluster with a JobManager and a TaskManager
with two task slots. By the way, if you omit the "-s 2" flag, then the
default is read from the config, which is one task slot.

Could it be that an old TaskManager instance is trying to register
with a new JobManager? It looks like it from the log messages because
the ResourceManager (which creates TaskManagers) is not aware of it.
Still questionable why that instance is lingering around. Could you
try to kill the instance and try bringing up a cluster several times
to see if that solved the problem? If not, could you send me the full
logs to my email address?

Thanks,
Max

On Wed, Apr 20, 2016 at 4:30 PM, Ufuk Celebi <uce@apache.org> wrote:
> The user list is OK since you are reporting a bug here ;-) I'm
> confident that this will be fixed soon! :-)
>
> On Wed, Apr 20, 2016 at 11:28 AM, Stefano Baghino
> <stefano.baghino@radicalbit.io> wrote:
>> Not exactly, I just wanted to let you know about it and know if someone else
>> experimented this issue; perhaps it's more of a dev mailing list discussion,
>> sorry for posting this here. Feel free to continue the discussion on the
>> other list if you feel it's more appropriate.
>>
>> On Tue, Apr 19, 2016 at 6:53 PM, Ufuk Celebi <uce@apache.org> wrote:
>>>
>>> Hey Stefano,
>>>
>>> Flink's resource management has been refactored for 1.1 recently. This
>>> could be a regression introduced by this. Max can probably help you
>>> with more details. Is this currently a blocker for you?
>>>
>>> – Ufuk
>>>
>>> On Tue, Apr 19, 2016 at 6:31 PM, Stefano Baghino
>>> <stefano.baghino@radicalbit.io> wrote:
>>> > Hi everyone,
>>> >
>>> > I'm currently experiencing a weird situation, I hope you can help me out
>>> > with this.
>>> >
>>> > I've cloned and built from the master, then I've edited the default
>>> > config
>>> > fil by adding my Hadoop config path, exported the HADOOP_CONF_DIR env
>>> > var
>>> > and ran bin/yarn-session.sh -n 1 -s 2 -jm 2048 -tm 2048
>>> >
>>> > The first thing I noticed is that I had to put "-s 2" or the task
>>> > managers
>>> > gets created with -1 slots (!) by default.
>>> >
>>> > After putting "-s 2" the YARN session startup hangs when trying to
>>> > register
>>> > the task managers. I've stopped the session and aggregated the logs and
>>> > read
>>> > a lot (several thousands) of the messages I attach at the bottom; any
>>> > idea
>>> > of what this may be?
>>> >
>>> > Thank you a lot in advance!
>>> >
>>> > 2016-04-19 12:15:59,507 INFO  org.apache.flink.yarn.YarnTaskManager
>>> > - Trying to register at JobManager
>>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1,
>>> > timeout:
>>> > 500 milliseconds)
>>> >
>>> > 2016-04-19 12:15:59,649 ERROR org.apache.flink.yarn.YarnTaskManager
>>> > - The registration at JobManager
>>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
>>> > because: java.lang.IllegalStateException: Resource
>>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not
>>> > registered with resource manager.. Retrying later...
>>> >
>>> > 2016-04-19 12:16:00,025 INFO  org.apache.flink.yarn.YarnTaskManager
>>> > - Trying to register at JobManager
>>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 2,
>>> > timeout:
>>> > 1000 milliseconds)
>>> >
>>> > 2016-04-19 12:16:00,033 ERROR org.apache.flink.yarn.YarnTaskManager
>>> > - The registration at JobManager
>>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
>>> > because: java.lang.IllegalStateException: Resource
>>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not
>>> > registered with resource manager.. Retrying later...
>>> >
>>> > 2016-04-19 12:16:01,045 INFO  org.apache.flink.yarn.YarnTaskManager
>>> > - Trying to register at JobManager
>>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 3,
>>> > timeout:
>>> > 2000 milliseconds)
>>> >
>>> > 2016-04-19 12:16:01,053 ERROR org.apache.flink.yarn.YarnTaskManager
>>> > - The registration at JobManager
>>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
>>> > because: java.lang.IllegalStateException: Resource
>>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not
>>> > registered with resource manager.. Retrying later...
>>> >
>>> > 2016-04-19 12:16:03,064 INFO  org.apache.flink.yarn.YarnTaskManager
>>> > - Trying to register at JobManager
>>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 4,
>>> > timeout:
>>> > 4000 milliseconds)
>>> >
>>> > 2016-04-19 12:16:03,072 ERROR org.apache.flink.yarn.YarnTaskManager
>>> > - The registration at JobManager
>>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
>>> > because: java.lang.IllegalStateException: Resource
>>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not
>>> > registered with resource manager.. Retrying later...
>>> >
>>> > 2016-04-19 12:16:07,085 INFO  org.apache.flink.yarn.YarnTaskManager
>>> > - Trying to register at JobManager
>>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 5,
>>> > timeout:
>>> > 8000 milliseconds)
>>> >
>>> > 2016-04-19 12:16:07,092 ERROR org.apache.flink.yarn.YarnTaskManager
>>> > - The registration at JobManager
>>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
>>> > because: java.lang.IllegalStateException: Resource
>>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not
>>> > registered with resource manager.. Retrying later...
>>> >
>>> > 2016-04-19 12:16:09,664 INFO  org.apache.flink.yarn.YarnTaskManager
>>> > - Trying to register at JobManager
>>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1,
>>> > timeout:
>>> > 500 milliseconds)
>>> >
>>> >
>>> > --
>>> > BR,
>>> > Stefano Baghino
>>> >
>>> > Software Engineer @ Radicalbit
>>
>>
>>
>>
>> --
>> BR,
>> Stefano Baghino
>>
>> Software Engineer @ Radicalbit

Mime
View raw message