[ https://issues.apache.org/jira/browse/AURORA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15693643#comment-15693643
]
Kostiantyn Bokhan edited comment on AURORA-1830 at 11/24/16 4:29 PM:
---------------------------------------------------------------------
The problem may be related to the DC/OS mesos configuration. I'm trying to integrated Aurora
with DC/OS in order to provide gpu batch scheduling. Mesos-agents are executed with the next
options:
{noformat}
mesos-agent[2270]: kages/mesos--55e36b7783f1549d26b7567b11090ff93b89487a/libexec/mesos" --logbufsecs="0"
--logging_level="INFO" --master="zk://zk-1.zk:2181,zk-2.zk:2181,zk-3.zk:2181,zk-4.zk:2181,zk-5.zk:2181/mesos"
--modules_dir="/opt/mesosphere/etc/mesos-slave-modules" --network_cni_config_dir="/opt/mesosphere/etc/dcos/network/cni"
--network_cni_plugins_dir="/opt/mesosphere/active/cni/" --nvidia_gpu_devices="[ 0, 1 ]" --oversubscribed_resources_interval="15secs"
--perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
--quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs"
--resources="[{"name": "ports", "ranges": {"range": [{"begin": 1025, "end": 2180}, {"begin":
2182, "end": 3887}, {"begin": 3889, "end": 5049}, {"begin": 5052, "end": 8079}, {"begin":
8082, "end": 8180}, {"begin": 8182, "end": 32000}]}, "type": "RANGES"}, {"scalar": {"value":
2}, "name": "gpus", "type": "SCALAR"}, {"scalar": {"value": 428201}, "name": "disk", "type":
"SCALAR", "role": "*"}]" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox"
--strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system"
--version="false" --work_dir="/var/lib/mesos/slave"
{noformat}
So, --sandbox_directory is default. But *mesos-docker-executor* is executed with the next
options:
{noformat}
mesos-docker-executor --container=mesos-195fbdc8-6720-443b-b036-7fa5608b27cc-S21.4bbf7f29-3467-4583-8ca1-94539d698911
--docker=docker --docker_socket=/var/run/docker.sock --help=false --launcher_dir=/opt/mesosphere/packages/mesos--55e36b7783f1549d26b7567b11090ff93b89487a/libexec/mesos
--mapped_directory=/mnt/mesos/sandbox --sandbox_directory=/var/lib/mesos/slave/slaves/195fbdc8-6720-443b-b036-7fa5608b27cc-S21/frameworks/195fbdc8-6720-443b-b036-7fa5608b27cc-0000/executors/aurora_aurora-executor.d8e82d61-ad8c-11e6-879b-70b3d5800003/runs/4bbf7f29-3467-4583-8ca1-94539d698911
--stop_timeout=20secs
{noformat}
Where --launcher_dir=/opt/mesosphere/packages/mesos--55e36b7783f1549d26b7567b11090ff93b89487a/libexec/mesos
This path leads to the mesos package in DC/OS installation....
I'v tried configuring the thermos_executor :
{noformat}
thermos_executor --announcer-ensemble 127.0.0.1:2181 --mesos-containerizer-path=/opt/mesosphere/packages/mesos--55e36b7783f1549d26b7567b11090ff93b89487a/libexec/mesos
{noformat}
But the issue is still here...
was (Author: kr0t):
The problem may be related to the DC/OS mesos configuration. I'm trying to integrated Aurora
with DC/OS in order to provide gpu batch scheduling.
--mesos-containerizer-path should be set as the next:
{noformat}
command {
uris {
value: "/usr/bin/thermos_executor"
executable: true
}
value: "${MESOS_SANDBOX=.}/thermos_executor --announcer-ensemble 127.0.0.1:2181 --mesos-containerizer-path=/opt/mesosphere/packages/mesos--55e36b7783f1549d26b7567b11090ff93b89487a/libexec/mesos"
}
{noformat}
But the issue is still here.
Maybe, There are other paths that should be adjusted...
> Unknown exception initializing sandbox
> --------------------------------------
>
> Key: AURORA-1830
> URL: https://issues.apache.org/jira/browse/AURORA-1830
> Project: Aurora
> Issue Type: Bug
> Components: Executor
> Affects Versions: 0.16.0
> Reporter: Kostiantyn Bokhan
>
> When launching a job using the Mesos containerizer and a docker image, the sandbox setup
fails with the following error:
> {quote}
> FAILED • Unknown exception initializing sandbox: [Errno 2] No such file or directory
> {quote}
> Aurora file:
> {code}
> # run the script
> python = Process(
> name = 'python',
> cmdline = 'python --version')
> # describe the task
> python_task = Task(
> processes = [python],
> resources = Resources(cpu = 1, ram = 1*GB, disk=8*GB))
> jobs = [
> Service(cluster = 'MY Cluster',
> environment = 'devel',
> role = 'root',
> name = 'python',
> task = python_task,
> container = Mesos( image = DockerImage (name = 'python', tag = '2')))
> ]
> {code}
> *__main__.log*:
> {noformat}
> Log file created at: 2016/11/24 14:45:44
> Running on machine: gnode1
> [DIWEF]mmdd hh:mm:ss.uuuuuu pid file:line] msg
> Command line: /var/lib/mesos/slave/slaves/195fbdc8-6720-443b-b036-7fa5608b27cc-S24/frameworks/195fbdc8-6720-443b-b036-7fa5608b27cc-0014/executors/thermos-root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8/runs/e25e2e98-0b65-4e9f-a86d-13a18dff01bc/thermos_executor
--announcer-ensemble 127.0.0.1:2181
> I1124 14:45:44.041621 25610 executor_base.py:45] Executor [None]: registered() called
with:
> I1124 14:45:44.042294 25610 executor_base.py:45] Executor [None]: ExecutorInfo: executor_id
{
> value: "thermos-root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8"
> }
> resources {
> name: "cpus"
> type: SCALAR
> scalar {
> value: 0.25
> }
> role: "*"
> }
> resources {
> name: "mem"
> type: SCALAR
> scalar {
> value: 128.0
> }
> role: "*"
> }
> command {
> uris {
> value: "/usr/bin/thermos_executor"
> executable: true
> }
> value: "${MESOS_SANDBOX=.}/thermos_executor --announcer-ensemble 127.0.0.1:2181"
> }
> framework_id {
> value: "195fbdc8-6720-443b-b036-7fa5608b27cc-0014"
> }
> name: "AuroraExecutor"
> source: "root.devel.python.0"
> container {
> type: MESOS
> volumes {
> container_path: "taskfs"
> mode: RO
> image {
> type: DOCKER
> docker {
> name: python:2"
> }
> }
> }
> mesos {
> }
> }
> labels {
> labels {
> key: "source"
> value: "root.devel.python.0"
> }
> }
> I1124 14:45:44.042458 25610 executor_base.py:45] Executor [None]: FrameworkInfo: user:
"root"
> name: "Aurora"
> id {
> value: "195fbdc8-6720-443b-b036-7fa5608b27cc-0014"
> }
> failover_timeout: 1814400.0
> checkpoint: true
> hostname: "vnode7"
> capabilities {
> type: GPU_RESOURCES
> }
> I1124 14:45:44.043046 25610 executor_base.py:45] Executor [None]: SlaveInfo: hostname:
"000.000.00.001"
> resources {
> name: "gpus"
> type: SCALAR
> scalar {
> value: 2.0
> }
> role: "*"
> }
> resources {
> name: "ports"
> type: RANGES
> ranges {
> range {
> begin: 1025
> end: 2180
> }
> range {
> begin: 2182
> end: 3887
> }
> range {
> begin: 3889
> end: 5049
> }
> range {
> begin: 5052
> end: 8079
> }
> range {
> begin: 8082
> end: 8180
> }
> range {
> begin: 8182
> end: 32000
> }
> }
> role: "*"
> }
> resources {
> name: "disk"
> type: SCALAR
> scalar {
> value: 428201.0
> }
> role: "*"
> }
> resources {
> name: "cpus"
> type: SCALAR
> scalar {
> value: 8.0
> }
> role: "*"
> }
> resources {
> name: "mem"
> type: SCALAR
> scalar {
> value: 14957.0
> }
> role: "*"
> }
> attributes {
> name: "hostname"
> type: TEXT
> text {
> value: "gnode1"
> }
> }
> attributes {
> name: "ip"
> type: TEXT
> text {
> value: "000.000.00.001"
> }
> }
> attributes {
> name: "rack"
> type: TEXT
> text {
> value: "gpu"
> }
> }
> attributes {
> name: "gputype"
> type: TEXT
> text {
> value: "titanz"
> }
> }
> id {
> value: "195fbdc8-6720-443b-b036-7fa5608b27cc-S24"
> }
> checkpoint: true
> port: 5051
> I1124 14:45:44.043673 25610 executor_base.py:45] Executor [None]: launchTask got task:
root/devel/python:root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8
> I1124 14:45:44.044601 25610 executor_base.py:45] Executor [195fbdc8-6720-443b-b036-7fa5608b27cc-S24]:
Updating root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8 => STARTING
> I1124 14:45:44.044718 25610 executor_base.py:45] Executor [195fbdc8-6720-443b-b036-7fa5608b27cc-S24]:
Reason: Initializing sandbox.
> F1124 14:45:44.049196 25610 aurora_executor.py:85] Unknown exception initializing sandbox:
[Errno 2] No such file or directory
> I1124 14:45:44.049439 25610 executor_base.py:45] Executor [195fbdc8-6720-443b-b036-7fa5608b27cc-S24]:
Updating root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8 => FAILED
> I1124 14:45:44.049519 25610 executor_base.py:45] Executor [195fbdc8-6720-443b-b036-7fa5608b27cc-S24]:
Reason: Unknown exception initializing sandbox: [Errno 2] No such file or directory
> I1124 14:45:49.152787 25610 thermos_executor_main.py:299] MesosExecutorDriver.run() has
finished.
> {noformat}
> *stderr*
> {noformat}
> I1124 14:45:43.559283 25614 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/195fbdc8-6720-443b-b036-7fa5608b27cc-S24\/root","items":[{"action":"BYPASS_CACHE","uri":{"executable":true,"extract":true,"value":"\/usr\/bin\/thermos_executor"}}],"sandbox_directory":"\/var\/lib\/mesos\/slave\/slaves\/195fbdc8-6720-443b-b036-7fa5608b27cc-S24\/frameworks\/195fbdc8-6720-443b-b036-7fa5608b27cc-0014\/executors\/thermos-root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8\/runs\/e25e2e98-0b65-4e9f-a86d-13a18dff01bc","user":"root"}
> I1124 14:45:43.561226 25614 fetcher.cpp:409] Fetching URI '/usr/bin/thermos_executor'
> I1124 14:45:43.561242 25614 fetcher.cpp:250] Fetching directly into the sandbox directory
> I1124 14:45:43.561266 25614 fetcher.cpp:187] Fetching URI '/usr/bin/thermos_executor'
> I1124 14:45:43.561285 25614 fetcher.cpp:167] Copying resource with command:cp '/usr/bin/thermos_executor'
'/var/lib/mesos/slave/slaves/195fbdc8-6720-443b-b036-7fa5608b27cc-S24/frameworks/195fbdc8-6720-443b-b036-7fa5608b27cc-0014/executors/thermos-root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8/runs/e25e2e98-0b65-4e9f-a86d-13a18dff01bc/thermos_executor'
> I1124 14:45:43.569787 25614 fetcher.cpp:547] Fetched '/usr/bin/thermos_executor' to '/var/lib/mesos/slave/slaves/195fbdc8-6720-443b-b036-7fa5608b27cc-S24/frameworks/195fbdc8-6720-443b-b036-7fa5608b27cc-0014/executors/thermos-root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8/runs/e25e2e98-0b65-4e9f-a86d-13a18dff01bc/thermos_executor'
> twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
> Writing log files to disk in /var/lib/mesos/slave/slaves/195fbdc8-6720-443b-b036-7fa5608b27cc-S24/frameworks/195fbdc8-6720-443b-b036-7fa5608b27cc-0014/executors/thermos-root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8/runs/e25e2e98-0b65-4e9f-a86d-13a18dff01bc
> I1124 14:45:44.033974 25610 exec.cpp:161] Version: 1.0.0
> I1124 14:45:44.040127 25639 exec.cpp:236] Executor registered on agent 195fbdc8-6720-443b-b036-7fa5608b27cc-S24
> FATAL] Unknown exception initializing sandbox: [Errno 2] No such file or directory
> twitter.common.app debug: Shutting application down.
> twitter.common.app debug: Running exit function for twitter.common.log (Logging subsystem.)
> twitter.common.app debug: Finishing up module teardown.
> twitter.common.app debug: Active thread: <_MainThread(MainThread, started 139772146038592)>
> twitter.common.app debug: Active thread (daemon): <_DummyThread(Dummy-2, started
daemon 139771946940160)>
> twitter.common.app debug: Exiting cleanly.
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
|