mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Wu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-6252) Do not validate start command when re-establishing connection to executor
Date Mon, 26 Sep 2016 17:52:20 GMT

    [ https://issues.apache.org/jira/browse/MESOS-6252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523735#comment-15523735
] 

Joseph Wu commented on MESOS-6252:
----------------------------------

This seems like the correct behavior from Mesos.

If the master only validated the {{ExecutorID}}, the framework could shoot itself (in the
foot) by doing something like this:
1) The framework starts a fresh Executor like this (some fields omitted):
{code}
executor_id {
  value: "my-first-executor"
}
command {
  value: "my-custom-executor"
}
{code}
2) The framework sends a task to that executor.  That task expects itself to be run by {{my-custom-executor}}.
3) The framework then sends another task to the executor, with an {{ExecutorInfo}} like:
{code}
executor_id {
  value: "my-first-executor"
}
command {
  value: "completely-different-executor-that-is-incompatible-with-my-custom-executor"
}
{code}
4) The framework clearly intended to create a *new* Executor for the second task.  But because
the framework re-used the {{ExecutorID}}, this should clearly fail.  The second task can't
be expected to run on the {{my-custom-executor}}.

> Do not validate start command when re-establishing connection to executor
> -------------------------------------------------------------------------
>
>                 Key: MESOS-6252
>                 URL: https://issues.apache.org/jira/browse/MESOS-6252
>             Project: Mesos
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.28.1
>         Environment: coreos
>            Reporter: Markus Jura
>
> When a framework re-connects to an existing executor then Mesos is checking if the new
start command of the {{ExecutorInfo}} equals the old start command. 
> In case of the ConductR framework, these start command can be different due to a different
value in the ConductR agent argument {{--core-node}}.
> As a result, Mesos master is sending a {{TASK_ERROR}} for each running task to the framework.
The reason of the error is {{REASON_TASK_INVALID}}.
> {code}
> 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR MesosSchedulerClient
[sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, akkaTimestamp=11:34:48.713UTC,
akkaSource=akka.tcp://stop-all-bundles-1@10.0.0.248:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state TASK_ERROR received by
the scheduler: task_id {
>   value: "fe65b273-61c1-4ccf-8852-bb04e2dd9380"
> }
> state: TASK_ERROR
> message: "Task has invalid ExecutorInfo (existing ExecutorInfo with same ExecutorID is
not compatible).\n------------------------------------------------------------\nExisting ExecutorInfo:\nexecutor_id
{\n  value: \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n  name: \"cpus\"\n  type:
SCALAR\n  scalar {\n    value: 0.9\n  }\n  role: \"*\"\n}\nresources {\n  name: \"mem\"\n
 type: SCALAR\n  scalar {\n    value: 402.653184\n  }\n  role: \"*\"\n}\nresources {\n  name:
\"disk\"\n  type: SCALAR\n  scalar {\n    value: 1000\n  }\n  role: \"*\"\n}\nresources {\n
 name: \"ports\"\n  type: RANGES\n  ranges {\n    range {\n      begin: 2552\n      end: 2552\n
   }\n    range {\n      begin: 10000\n      end: 10999\n    }\n  }\n  role: \"*\"\n}\ncommand
{\n  uris {\n    value: \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n
   executable: false\n    extract: true\n    cache: false\n  }\n  uris {\n    value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n
   executable: false\n    extract: true\n    cache: false\n  }\n  value: \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\'
&& export JAVA_HOME=$(echo $(pwd)/jre*) && ./conductr-agent-*/bin/conductr-agent
-Dconfig.resource=mesos.conf -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 -Dconductr-agent.run.allocated-ports.start=10000
-Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.246:9004 --core-system-name
stop-all-bundles-1\"\n}\nframework_id {\n  value: \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource:
\"conductr\"\n\n------------------------------------------------------------\nTask\'s ExecutorInfo:\nexecutor_id
{\n  value: \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n  name: \"cpus\"\n  type:
SCALAR\n  scalar {\n    value: 0.9\n  }\n  role: \"*\"\n}\nresources {\n  name: \"mem\"\n
 type: SCALAR\n  scalar {\n    value: 402.653184\n  }\n  role: \"*\"\n}\nresources {\n  name:
\"disk\"\n  type: SCALAR\n  scalar {\n    value: 1000\n  }\n  role: \"*\"\n}\nresources {\n
 name: \"ports\"\n  type: RANGES\n  ranges {\n    range {\n      begin: 2552\n      end: 2552\n
   }\n    range {\n      begin: 10000\n      end: 10999\n    }\n  }\n  role: \"*\"\n}\ncommand
{\n  uris {\n    value: \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n
   executable: false\n    extract: true\n    cache: false\n  }\n  uris {\n    value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n
   executable: false\n    extract: true\n    cache: false\n  }\n  value: \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\'
&& export JAVA_HOME=$(echo $(pwd)/jre*) && ./conductr-agent-*/bin/conductr-agent
-Dconfig.resource=mesos.conf -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 -Dconductr-agent.run.allocated-ports.start=10000
-Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.248:9004 --core-system-name
stop-all-bundles-1\"\n}\nframework_id {\n  value: \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource:
\"conductr\"\n\n------------------------------------------------------------\n"
> slave_id {
>   value: "1154b639-c536-41d1-b9df-a57b24792acb-S4"
> }
> timestamp: 1.474889688506464E9
> source: SOURCE_MASTER
> reason: REASON_TASK_INVALID
> 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR MesosSchedulerClient
[sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, akkaTimestamp=11:34:48.714UTC,
akkaSource=akka.tcp://stop-all-bundles-1@10.0.0.248:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state TASK_ERROR received by
the scheduler: task_id {
>   value: "40034b01-e853-4ada-882f-9aaab67f77c2"
> }
> {code}
> Mesos should only validate the executor id. If the new id of the {{ExecutorInfo}} object
equals the old one then it should allow the reconnection to the running executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message