mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jura (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MESOS-6252) Do not validate start command when re-establishing connection to executor
Date Mon, 07 Nov 2016 10:37:58 GMT

    [ https://issues.apache.org/jira/browse/MESOS-6252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15643792#comment-15643792
] 

Markus Jura edited comment on MESOS-6252 at 11/7/16 10:37 AM:
--------------------------------------------------------------

The executor id alone should determine if the same executor should be used or not. If an executor
with id 123 exists on the slave and the framework sends an ExecutorInfo object with the executor
id 123 then I'd just re-use this executor. In our case, the executor start command is created
programmatically and is different depending on the IP address of the framework. If the executor
already exists then the start command of the ExecutorInfo should be just ignored. In other
words, the start command should not be after an existing executor  by id has been found on
the slave.


was (Author: markusjura):
The executor id alone should determine if the same executor should be used or not. If an executor
with id 123 exists on the slave and the framework sends an ExecutorInfo object with the executor
id 123 then I'd just re-use this executor. In our case, the executor start command is created
programmatically and is different depending on the IP address of the framework. If the executor
already exists then the start command of the ExecutorInfo should be just ignored. More general,
the start command should not be validated at all.

> Do not validate start command when re-establishing connection to executor
> -------------------------------------------------------------------------
>
>                 Key: MESOS-6252
>                 URL: https://issues.apache.org/jira/browse/MESOS-6252
>             Project: Mesos
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.28.1
>         Environment: coreos
>            Reporter: Markus Jura
>
> When a framework re-connects to an existing executor then Mesos is checking if the new
start command of the {{ExecutorInfo}} equals the old start command. 
> In case of the ConductR framework, these start command can be different due to a different
value in the ConductR agent argument {{--core-node}}.
> As a result, Mesos master is sending a {{TASK_ERROR}} for each running task to the framework.
The reason of the error is {{REASON_TASK_INVALID}}.
> {code}
> 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR MesosSchedulerClient
[sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, akkaTimestamp=11:34:48.713UTC,
akkaSource=akka.tcp://stop-all-bundles-1@10.0.0.248:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state TASK_ERROR received by
the scheduler: task_id {
>   value: "fe65b273-61c1-4ccf-8852-bb04e2dd9380"
> }
> state: TASK_ERROR
> message: "Task has invalid ExecutorInfo (existing ExecutorInfo with same ExecutorID is
not compatible).\n------------------------------------------------------------\nExisting ExecutorInfo:\nexecutor_id
{\n  value: \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n  name: \"cpus\"\n  type:
SCALAR\n  scalar {\n    value: 0.9\n  }\n  role: \"*\"\n}\nresources {\n  name: \"mem\"\n
 type: SCALAR\n  scalar {\n    value: 402.653184\n  }\n  role: \"*\"\n}\nresources {\n  name:
\"disk\"\n  type: SCALAR\n  scalar {\n    value: 1000\n  }\n  role: \"*\"\n}\nresources {\n
 name: \"ports\"\n  type: RANGES\n  ranges {\n    range {\n      begin: 2552\n      end: 2552\n
   }\n    range {\n      begin: 10000\n      end: 10999\n    }\n  }\n  role: \"*\"\n}\ncommand
{\n  uris {\n    value: \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n
   executable: false\n    extract: true\n    cache: false\n  }\n  uris {\n    value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n
   executable: false\n    extract: true\n    cache: false\n  }\n  value: \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\'
&& export JAVA_HOME=$(echo $(pwd)/jre*) && ./conductr-agent-*/bin/conductr-agent
-Dconfig.resource=mesos.conf -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 -Dconductr-agent.run.allocated-ports.start=10000
-Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.246:9004 --core-system-name
stop-all-bundles-1\"\n}\nframework_id {\n  value: \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource:
\"conductr\"\n\n------------------------------------------------------------\nTask\'s ExecutorInfo:\nexecutor_id
{\n  value: \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n  name: \"cpus\"\n  type:
SCALAR\n  scalar {\n    value: 0.9\n  }\n  role: \"*\"\n}\nresources {\n  name: \"mem\"\n
 type: SCALAR\n  scalar {\n    value: 402.653184\n  }\n  role: \"*\"\n}\nresources {\n  name:
\"disk\"\n  type: SCALAR\n  scalar {\n    value: 1000\n  }\n  role: \"*\"\n}\nresources {\n
 name: \"ports\"\n  type: RANGES\n  ranges {\n    range {\n      begin: 2552\n      end: 2552\n
   }\n    range {\n      begin: 10000\n      end: 10999\n    }\n  }\n  role: \"*\"\n}\ncommand
{\n  uris {\n    value: \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n
   executable: false\n    extract: true\n    cache: false\n  }\n  uris {\n    value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n
   executable: false\n    extract: true\n    cache: false\n  }\n  value: \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\'
&& export JAVA_HOME=$(echo $(pwd)/jre*) && ./conductr-agent-*/bin/conductr-agent
-Dconfig.resource=mesos.conf -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 -Dconductr-agent.run.allocated-ports.start=10000
-Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.248:9004 --core-system-name
stop-all-bundles-1\"\n}\nframework_id {\n  value: \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource:
\"conductr\"\n\n------------------------------------------------------------\n"
> slave_id {
>   value: "1154b639-c536-41d1-b9df-a57b24792acb-S4"
> }
> timestamp: 1.474889688506464E9
> source: SOURCE_MASTER
> reason: REASON_TASK_INVALID
> 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR MesosSchedulerClient
[sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, akkaTimestamp=11:34:48.714UTC,
akkaSource=akka.tcp://stop-all-bundles-1@10.0.0.248:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state TASK_ERROR received by
the scheduler: task_id {
>   value: "40034b01-e853-4ada-882f-9aaab67f77c2"
> }
> {code}
> Mesos should only validate the executor id. If the new id of the {{ExecutorInfo}} object
equals the old one then it should allow the reconnection to the running executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message