flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Biswajit Das <biswajit...@gmail.com>
Subject Re: Flink -mesos-app master hang
Date Sat, 05 Aug 2017 02:16:23 GMT
Hi Till ,

Thank you for the reply , I have posted some logs with initial email chain
. I think issue is more to do with docker private registry when there is
authorization involved . I can run docker running Job manager and task
manager as separate task for marathon and connect via RPC port . I was
trying to run via mesos app master so that job manager itself launch the
task manager part of framework .

Thank you again

~ Biswajit

On Fri, Aug 4, 2017 at 3:17 AM, Till Rohrmann <trohrmann@apache.org> wrote:

> Hi Biswajit,
>
> are there any Mesos logs which might help us pinpointing the problem? I've
> actually never run Flink on Mesos with Docker images. But it could be that
> Flink does not set things properly up for running Docker images. I'll try
> to run Flink based on Docker images over the weekend in order to see
> whether I can reproduce the problem.
>
> Cheers,
> Till
>
> On Wed, Aug 2, 2017 at 8:48 PM, Biswajit Das <biswajit.ds@gmail.com>
> wrote:
>
>> Hi There,
>>
>> I have posted this here in the group a few days back and after that I
>> have been exchanging email with Eron, thanks to Eron for all the tips.
>> Now  I see this basic auth error, I'm little confused how come Job Manager
>> launched fine and task manager failing to auth.
>> Also, mesos doc says by default authenticate is false so it should not
>> have gone there,  do I have to disable somewhere inside flink ??? I don't
>> see any config or property in code.
>>
>> This is kind of blocker for me now for mesos deployment , really
>> appreciate for any inputs/suggestion
>>
>> ~ Biswajit
>>
>> ---------- Forwarded message ----------
>> From: Eron Wright <ewright@live.com>
>> Date: Wed, Aug 2, 2017 at 10:51 AM
>> ------------------------------
>> *From:* Biswajit Das <biswajit.ds@gmail.com>
>> *Sent:* Wednesday, August 2, 2017 10:19:45 AM
>> *To:* Eron Wright
>> *Subject:* Re: Flink -mesos-app master hang
>>
>> Hi Eron ,
>>
>> Good morning , I'm really sorry for flooding question . I'll post this
>> one to user group also .
>> I could narrow down the actual error thrown by mesos , seems like JM some
>> how not able to authenticate . I'm little confused if it is *docker
>> private registry tls error *or some thing else , I have started slave
>> even with --docker_config , previously mostly I was using  docker.tar.gz
>> with container for private repo authentication .
>>
>> 017-08-02 03:32:54,163 WARN  org.apache.flink.mesos.schedul
>> er.TaskMonitor                  - Mesos task taskmanager-00003 failed
>> unexpectedly.
>> 2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager * - Mesos task
>> taskmanager-00003 failed, with a TaskManager in launch or registration.
>> State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch
>> container: Unexpected WWW-Authenticate header format: 'Basic
>> realm="Registry Realm"')*
>> 2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Diagnostics for task
>> taskmanager-00003 in state TASK_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED
>> message=Failed to launch container: Unexpected WWW-Authenticate header
>> format: 'Basic realm="Registry Realm"'
>> 2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Total number of failed
>> tasks so far: 3
>> 2017-08-02 03:32:54,164 ERROR org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos session
>> because the number of failed tasks (3) exceeded the maximum failed tasks
>> (2). This number is controlled by the 'mesos.maximum-failed-tasks'
>> configuration setting. By default its the number of requested tasks.
>> 2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Shutting down cluster
>> with status FAILED : Stopping Mesos session because the number of failed
>> tasks (3) exceeded the maximum failed tasks (2). This number is controlled
>> by the 'mesos.maximum-failed-tasks' configuration setting. By default its
>> the number of requested tasks.
>> 2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Shutting down and
>> unregistering as a Mesos framework.
>> 2017-08-02 03:32:54,171 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos resource
>> master
>> root@ip-172-31-4-44:/etc/me
>>
>> On Tue, Aug 1, 2017 at 1:53 PM, Eron Wright <ewright@live.com> wrote:
>>
>>> I think you're on the right track, in trying to configure the docker
>>> image provider.  This is on Linux right, and you definitely restarted the
>>> agents?
>>>
>>>
>>> An important difference between the JM and the TM is that the JM is a
>>> task launched by the Marathon framework, whereas the TM is a task launched
>>> by the JM framework.  The respective configurations and behaviors are
>>> different.   For example, I see that Marathon is launching the JM with the
>>> Docker containerizer, whereas the JS is launching the TM with the Mesos
>>> containerizer (with Docker image provider support).     The Mesos
>>> containerizer is more modern and preferred, and I don't think Flink
>>> supports anything else.
>>>
>>>
>>> The doc I linked to shows how to launch a docker image-based container
>>> with mesos-execute.   Using mesos-execute to verify your cluster
>>> configuration is a good idea, to isolate any issue.  For example, see if
>>> you can launch a container using the Mesos containerizer and the Docker
>>> image provider, executing a simple command such as 'sleep'.
>>>
>>>
>>> Eron
>>> ------------------------------
>>> *From:* Biswajit Das <biswajit.ds@gmail.com>
>>> *Sent:* Tuesday, August 1, 2017 10:02:51 AM
>>> *To:* Eron Wright
>>>
>>> *Subject:* Re: Flink -mesos-app master hang
>>>
>>> Hi Eron ,
>>>
>>> Thank you for the email , I really appreciate your reply.
>>>
>>> That's what is confusing me. I have been running mesos with container
>>> both on staging and production for almost a year now with mostly
>>> spark/presto load everything containerize fairly big cluster. .. Here is
>>> one of my slave config . One interesting part here is ,  app master is
>>> launched and I can access job manager web UI from mesos frame work , I can
>>> also see it is registered itself as `flink` framework . The only thing I'm
>>> seeing task manager is showing `0` . I have asked to create 2 instance
>>>
>>>
>>> /usr/sbin/mesos-slave --master=zk://XXX/mesos --log_dir=/var/log/mesos
>>> --attributes=environment:dev;agent_role:generic *--containerizers=docker,mesos
>>> * --executor_registration_timeout=10mins --hostname=XXX *--image_providers=appc,docker
>>> --ip=XXX --isolation=filesystem/linux,docker/runtime*
>>> --resources=ports(*):[0-65535] --work_dir=/var/lib/mesos
>>>
>>>
>>> Previously I never had *--image_providers and --isolation* , after
>>> seeing this error I have added this two but not much help , I'm running on
>>> ubuntu /mesos 1.1.0 and submitting the job with marathon ..
>>>
>>>
>>> I have tried with toggling mesos debug log , not much info ...other hen
>>> git signal to kill the framework ..
>>>
>>> marathon json task
>>>
>>>> {
>>>>   "id": "/flink-app-master",
>>>>   "cmd": null,
>>>>   "cpus": 2,
>>>>   "mem": 4096,
>>>>   "disk": 10000,
>>>>   "instances": 1,
>>>>   "constraints": [
>>>>     [
>>>>       "hostname",
>>>>       "LIKE",
>>>>       "xxx" ->>> restricited to some host for debugging as I have
>>>> fairly big cluster
>>>>     ]
>>>>   ],
>>>>   "acceptedResourceRoles": [
>>>>     "*"
>>>>   ],
>>>>   "container": {
>>>>     "type": "DOCKER",
>>>>     "volumes": [],
>>>>     "docker": {
>>>>       "image": "docker.xx.xx/flink:1.8.0",
>>>>       "network": "HOST",
>>>>       "portMappings": [],
>>>>       "privileged": false,
>>>>       "parameters": [],
>>>>       "forcePullImage": false
>>>>     }
>>>>   },
>>>>   "env": {
>>>>     "MESOS_MASTER": "zk://XX/mesos"
>>>>   },
>>>>   "portDefinitions": [
>>>>     {
>>>>       "port": 9081,
>>>>       "protocol": "tcp",
>>>>       "name": "default",
>>>>       "labels": {}
>>>>     }
>>>>   ],
>>>>   "uris": [
>>>>     "file:///etc/docker.tar.gz"
>>>>   ],
>>>>   "fetch": [
>>>>     {
>>>>       "uri": "file:///etc/docker.tar.gz",
>>>>       "extract": true,
>>>>       "executable": false,
>>>>       "cache": false
>>>>     }
>>>>   ]
>>>> }
>>>>
>>>
>>> On Tue, Aug 1, 2017 at 7:22 AM, Eron Wright <ewright@live.com> wrote:
>>>
>>>> From the error message it seems that your Mesos cluster doesn't have
>>>> the docker image provisioner installed.   The message originates from Mesos
>>>> anyway so the problem lies there.   Note that docker image support is
>>>> provided in Linux only.  You can also use the Flink on Mesos support
>>>> without images, if you make sure that JAVA_HOME is set on all executors.
>>>>
>>>> Hope this helps!
>>>>
>>>> http://mesos.apache.org/documentation/latest/container-image/
>>>>
>>>> Get Outlook for Android <https://aka.ms/ghei36>
>>>>
>>>>
>>>>
>>>> From: Biswajit Das
>>>> Sent: Tuesday, August 1, 1:24 AM
>>>> Subject: Re: Flink -mesos-app master hang
>>>> To: ewright@live.com
>>>>
>>>>
>>>> Hi Eron ,  I have came across some of your comment in JIRA and wanted
>>>> to clarify this ^^ . I'm kind of running little clueless ,  Any pointer for
>>>> me to look ..
>>>>
>>>>
>>>> -----------------------------------------------
>>>> 2017-08-01 07:26:34,688 INFO  org.apache.flink.mesos.schedul
>>>> er.LaunchCoordinator            - Waiting for more offers; 1 task(s)
>>>> are not yet launched.
>>>> 2017-08-01 07:26:34,717 INFO  org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Launching Mesos task
>>>> taskmanager-00039 on host 172.31.5.212.
>>>> 2017-08-01 07:26:34,731 WARN  org.apache.flink.mesos.schedul
>>>> er.TaskMonitor                  - Mesos task taskmanager-00039 failed
>>>> unexpectedly.
>>>> *2017-08-01 07:26:34,733 INFO
>>>> org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
>>>> - Mesos task taskmanager-00039 failed, with a TaskManager in launch or
>>>> registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED
>>>> (Failed to launch container: Unsupported container image type: DOCKER)*
>>>> 2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Diagnostics for task
>>>> taskmanager-00039 in state TASK_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED
>>>> message=Failed to launch container: Unsupported container image type: DOCKER
>>>> 2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Total number of failed
>>>> tasks so far: 3
>>>> 2017-08-01 07:26:34,734 ERROR org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos session
>>>> because the number of failed tasks (3) exceeded the maximum failed tasks
>>>> (2). This number is controlled by the 'mesos.maximum-failed-tasks'
>>>> configuration setting. By default its the number of requested tasks.
>>>> 2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Shutting down cluster
>>>> with status FAILED : Stopping Mesos session because the number of failed
>>>> tasks (3) exceeded the maximum failed tasks (2). This number is controlled
>>>> by the 'mesos.maximum-failed-tasks' configuration setting. By default its
>>>> the number of requested tasks.
>>>> 2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Shutting down and
>>>> unregistering as a Mesos framework.
>>>> 2017-08-01 07:26:34,745 INFO  org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos resource
>>>> master
>>>> 2017-08-01 07:26:34,745 INFO  org.apache.f
>>>> ---------------------------------------------------
>>>>
>>>> Thank you in advance .
>>>> ~Biswajit
>>>>
>>>> On Sun, Jul 30, 2017 at 12:42 PM, Biswajit Das <biswajit.ds@gmail.com>
>>>> wrote:
>>>>
>>>> Hi All,
>>>> I'm trying to run a flink docker from the marathon with mesos app
>>>> master; I could see it goes on a continuous loop and failed to launch the
>>>> task manger. If I go to mesos master UI I could see job manager web UI with
>>>> task manager zero .
>>>>
>>>> I have pretty much checked every possible log starting from Ubuntu
>>>> machine docker.log /mesos master/slave  pretty much no information other
>>>> than just failed task , I could see below log @ flink . However, I'm able
>>>> to run same docker image if I run jobamanger and taskmanager by itself in
>>>> marathon and let it connect via jobmanager RPC port .
>>>>
>>>> for mesos config , I'm using below details from yml
>>>> mesos.master: ${MESOS_MASTER}
>>>> mesos.failover-timeout: 60
>>>> mesos.initial-tasks: ${INITIAL_TASK_MANAGERS}
>>>> mesos.resourcemanager.tasks.mem: ${RESOURCEMANAGER_TASKS_MEM:-4096}
>>>> mesos.resourcemanager.tasks.cpus:${RESOURCEMANAGER_TASKS_CPU:-1}
>>>> mesos.resourcemanager.tasks.container.type: docker
>>>> mesos.resourcemanager.tasks.container.image.name: ${IMAGE_NAME}
>>>>
>>>> ---------------------------
>>>> 07-30 02:05:48,351 WARN  org.apache.flink.mesos.schedul
>>>> er.TaskMonitor                  - Mesos task taskmanager-00002 failed
>>>> unexpectedly.
>>>> 2017-07-30 02:05:48,352 INFO  org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Mesos task
>>>> taskmanager-00002 failed, with a TaskManager in launch or registration.
>>>> State: TASK_FAILED Reason: REASON_COMMAND_EXECUTOR_FAILED (Container exited
>>>> with status 127)
>>>> -----------------------------------------------------
>>>>
>>>> Please let me know if any one has any pointer to debug further ..
>>>>
>>>>
>>>> ~ Biswajit
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>

Mime
View raw message