mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pierre Cheynier (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2
Date Wed, 15 Feb 2017 18:53:41 GMT
Pierre Cheynier created MESOS-7130:
--------------------------------------

             Summary: port_mapping isolator: executor hangs when running on EC2
                 Key: MESOS-7130
                 URL: https://issues.apache.org/jira/browse/MESOS-7130
             Project: Mesos
          Issue Type: Bug
          Components: ec2, executor
            Reporter: Pierre Cheynier


Hi,
I'm experiencing a weird issue: I'm using a CI to do testing on infrastructure automation.
I recently activated the {{network/port_mapping}} isolator.

I'm able to make the changes work and pass the test for bare-metal servers and virtualbox
VMs using this configuration.

But when I try on EC2 (on which my CI pipeline rely) it systematically fails to run any container.

It appears that the sandbox is created and the port_mapping isolator seems to be OK according
to the logs in stdout and stderr and the {tc} output :
{noformat}
+ mount --make-rslave /run/netns
+ test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
+ echo 1
+ ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
+ ethtool -K eth0 rx off
(...)
+ tc filter show dev eth0 parent ffff:0
+ tc filter show dev lo parent ffff:0
I0215 16:01:13.941375     1 exec.cpp:161] Version: 1.0.2
{noformat}

Then the executor never come back in REGISTERED state and hang indefinitely.

{GLOG_v=3} doesn't help here.

My skills in this area are limited, but trying to load the symbols and attach a gdb to the
mesos-executor process, I'm able to print this stack:
{noformat}
#0  0x00007feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007feffbed69ec in std::condition_variable::wait(std::unique_lock<std::mutex>&)
() from /usr/lib64/libstdc++.so.6
#2  0x00007ff0003dd8ec in void synchronized_wait<std::condition_variable, std::mutex>(std::condition_variable*,
std::mutex*) () from /usr/lib64/libmesos-1.0.2.so
#3  0x00007ff0017d595d in Gate::arrive(long) () from /usr/lib64/libmesos-1.0.2.so
#4  0x00007ff0017c00ed in process::ProcessManager::wait(process::UPID const&) () from
/usr/lib64/libmesos-1.0.2.so
#5  0x00007ff0017c5c05 in process::wait(process::UPID const&, Duration const&) ()
from /usr/lib64/libmesos-1.0.2.so
#6  0x00000000004ab26f in process::wait(process::ProcessBase const*, Duration const&)
()
#7  0x00000000004a3903 in main ()
{noformat}

I concluded that the underlying shell script launched by the isolator or the task itself is
just .. blocked. But I don't understand why.

Here is a process tree to show that I've no task running but the executor is:
{noformat}
root     28420  0.8  3.0 1061420 124940 ?      Ssl  17:56   0:25 /usr/sbin/mesos-slave --advertise_ip=127.0.0.1
--attributes=platform:centos;platform_major_version:7;type:base --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup
--cgroups_net_cls_primary_handle=0xC370 --container_logger=org_apache_mesos_LogrotateContainerLogger
--containerizers=mesos,docker --credential=file:///etc/mesos-chef/slave-credential --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
--default_role=default --docker_registry=/usr/share/mesos/users --docker_store_dir=/var/opt/mesos/store/docker
--egress_unique_flow_per_container --enforce_container_disk_quota --ephemeral_ports_per_container=128
--executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
--image_providers=docker --image_provisioner_backend=copy --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
--logging_level=INFO --master=zk://mesos:test@localhost.localdomain:2181/mesos --modules=file:///etc/mesos-chef/slave-modules.json
--port=5051 --recover=reconnect --resources=ports:[31000-32000];ephemeral_ports:[32768-57344]
--strict --work_dir=/var/opt/mesos
root     28484  0.0  2.3 433676 95016 ?        Ssl  17:56   0:00  \_ mesos-logrotate-logger
--help=false --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-0000/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout
--logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
root     28485  0.0  2.3 499212 94724 ?        Ssl  17:56   0:00  \_ mesos-logrotate-logger
--help=false --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-0000/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stderr
--logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
marathon 28487  0.0  2.4 635780 97388 ?        Ssl  17:56   0:00  \_ mesos-executor --launcher_dir=/usr/libexec/mesos
{noformat}

If someone has a clue about the issue I could experience on EC2, I would be interested to
talk...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message