mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Gummelt (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-2797) mesos-slave dies when it hits open file descriptor limit
Date Tue, 02 Jun 2015 00:18:17 GMT

    [ https://issues.apache.org/jira/browse/MESOS-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568277#comment-14568277
] 

Michael Gummelt commented on MESOS-2797:
----------------------------------------

Note that I had a "sleep 1" task running under marathon for a long time, so there could be
a lot of executor state somewhere.

> mesos-slave dies when it hits open file descriptor limit
> --------------------------------------------------------
>
>                 Key: MESOS-2797
>                 URL: https://issues.apache.org/jira/browse/MESOS-2797
>             Project: Mesos
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.22.1
>            Reporter: Michael Gummelt
>
> I'm running mesos-slave under systemd as part of Mesosphere's DCOS.  The slave process
is repeatedly dying as it hits the system's open file descriptor limit of 1024.  See the below
"master-slave.log" file.
> I stop mesos-slave, remove the directory specified in the slave logs, and still get the
same error.  lsof shows that mesos-slave is opening several hundred pipes.  See the below
"lsof.log" file.
> ====mesos-slave.log====
> Jun 01 23:49:19 dcos-01 systemd[1]: mesos-slave.service holdoff time over, scheduling
restart.
> Jun 01 23:49:19 dcos-01 systemd[1]: Stopping Mesos Slave...
> Jun 01 23:49:19 dcos-01 systemd[1]: Starting Mesos Slave...
> Jun 01 23:49:19 dcos-01 ping[14896]: PING leader.mesos (172.17.8.101) 56(84) bytes of
data.
> Jun 01 23:49:19 dcos-01 ping[14896]: 64 bytes from dcos-01 (172.17.8.101): icmp_seq=1
ttl=64 time=0.023 ms
> Jun 01 23:49:19 dcos-01 ping[14896]: --- leader.mesos ping statistics ---
> Jun 01 23:49:19 dcos-01 ping[14896]: 1 packets transmitted, 1 received, 0% packet loss,
time 0ms
> Jun 01 23:49:19 dcos-01 ping[14896]: rtt min/avg/max/mdev = 0.023/0.023/0.023/0.000 ms
> Jun 01 23:49:19 dcos-01 systemd[1]: Started Mesos Slave.
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.713110 14899 logging.cpp:172]
INFO level logging started!
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.715564 14899 main.cpp:156]
Build: 2015-05-19 18:43:41 by
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.715600 14899 main.cpp:158]
Version: 0.22.1
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.715618 14899 main.cpp:165]
Git SHA: dd082c8656eb6e93e091a12fc5cfee3700a61bb1
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.830142 14899 containerizer.cpp:110]
Using isolation: cgroups/cpu,cgroups/mem
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.845340 14899 linux_launcher.cpp:94]
Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.845696 14899 main.cpp:200]
Starting Mesos slave
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,845:14899(0x7f111ff43700):ZOO_INFO@log_env@712:
Client environment:zookeeper.version=zookeeper C client 3.4.5
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@716:
Client environment:host.name=dcos-01
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@723:
Client environment:os.name=Linux
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@724:
Client environment:os.arch=3.19.0
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@725:
Client environment:os.version=#2 SMP Thu Mar 26 10:44:46 UTC 2015
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@733:
Client environment:user.name=(null)
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@741:
Client environment:user.home=/root
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@log_env@753:
Client environment:user.dir=/
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,846:14899(0x7f111ff43700):ZOO_INFO@zookeeper_init@786:
Initiating client connection, host=leader.mesos:2181 sessionTimeout=10000 watcher=0x7f11246c0140
sessionId=0 sessionPasswd=<null> context=0x7f1114000b40 flags=0
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.846161 14899 slave.cpp:174]
Slave started on 1)@172.17.8.101:5051
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.846206 14899 slave.cpp:194]
Moving slave process into its own cgroup for subsystem: cpu
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,855:14899(0x7f110bde7700):ZOO_INFO@check_events@1703:
initiated connection to server [172.17.8.101:2181]
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: 2015-06-01 23:49:19,855:14899(0x7f110bde7700):ZOO_INFO@check_events@1750:
session establishment complete on server [172.17.8.101:2181], sessionId=0x14d77b31175030e,
negotiated timeout=10000
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.856979 14900 group.cpp:313]
Group process (group(1)@172.17.8.101:5051) connected to ZooKeeper
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.857028 14900 group.cpp:790]
Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.857049 14900 group.cpp:385]
Trying to create path '/mesos' in ZooKeeper
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.869518 14900 detector.cpp:138]
Detected a new leader: (id='16')
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.869675 14900 group.cpp:659]
Trying to get '/mesos/info_0000000016' in ZooKeeper
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.870889 14900 detector.cpp:452]
A new leading master (UPID=master@172.17.8.101:5050) is detected
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.875787 14899 slave.cpp:194]
Moving slave process into its own cgroup for subsystem: memory
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.880331 14899 slave.cpp:322]
Slave resources: ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 8182-65535];
cpus(*):4; mem(*):2933; disk(*):10823
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.880523 14899 slave.cpp:351]
Slave hostname: dcos-01
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.880553 14899 slave.cpp:352]
Slave checkpoint: true
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.883630 14903 state.cpp:35]
Recovering state from '/var/lib/mesos/slave/meta'
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.883815 14900 status_update_manager.cpp:197]
Recovering status update manager
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.883940 14904 containerizer.cpp:307]
Recovering containerizer
> Jun 01 23:49:19 dcos-01 mesos-slave[14899]: I0601 23:49:19.883949 14907 docker.cpp:423]
Recovering Docker containers
> Jun 01 23:49:24 dcos-01 mesos-slave[14899]: Failed to perform recovery: Collect failed:
Collect failed: Failed to create pipe: Too many open files
> Jun 01 23:49:24 dcos-01 mesos-slave[14899]: To remedy this do as follows:
> Jun 01 23:49:24 dcos-01 mesos-slave[14899]: Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
> Jun 01 23:49:24 dcos-01 mesos-slave[14899]: This ensures slave doesn't recover old live
executors.
> Jun 01 23:49:24 dcos-01 mesos-slave[14899]: Step 2: Restart the slave.
> Jun 01 23:49:24 dcos-01 systemd[1]: mesos-slave.service: main process exited, code=exited,
status=1/FAILURE
> Jun 01 23:49:24 dcos-01 systemd[1]: Unit mesos-slave.service entered failed state.
> Jun 01 23:49:24 dcos-01 systemd[1]: mesos-slave.service failed.
> =====lsof.log====
> mesos-sla 30306            root  563r     FIFO                0,9      0t0   10642859
pipe
> mesos-sla 30306            root  564r     FIFO                0,9      0t0   10642862
pipe
> mesos-sla 30306            root  565r     FIFO                0,9      0t0   10642861
pipe
> mesos-sla 30306            root  566r     FIFO                0,9      0t0   10642864
pipe
> mesos-sla 30306            root  567r     FIFO                0,9      0t0   10642863
pipe
> mesos-sla 30306            root  568r     FIFO                0,9      0t0   10642866
pipe
> mesos-sla 30306            root  569r     FIFO                0,9      0t0   10642865
pipe
> mesos-sla 30306            root  570r     FIFO                0,9      0t0   10642868
pipe
> mesos-sla 30306            root  571r     FIFO                0,9      0t0   10642867
pipe
> mesos-sla 30306            root  572r     FIFO                0,9      0t0   10642879
pipe
> mesos-sla 30306            root  573r     FIFO                0,9      0t0   10642869
pipe
> mesos-sla 30306            root  574r     FIFO                0,9      0t0   10642881
pipe
> mesos-sla 30306            root  575r     FIFO                0,9      0t0   10642880
pipe
> mesos-sla 30306            root  576r     FIFO                0,9      0t0   10642883
pipe
> mesos-sla 30306            root  577r     FIFO                0,9      0t0   10642882
pipe
> mesos-sla 30306            root  578r     FIFO                0,9      0t0   10642891
pipe
> mesos-sla 30306            root  579r     FIFO                0,9      0t0   10642884
pipe
> mesos-sla 30306            root  580r     FIFO                0,9      0t0   10642893
pipe
> mesos-sla 30306            root  581r     FIFO                0,9      0t0   10642892
pipe
> mesos-sla 30306            root  582r     FIFO                0,9      0t0   10642895
pipe
> mesos-sla 30306            root  583r     FIFO                0,9      0t0   10642894
pipe
> mesos-sla 30306            root  584r     FIFO                0,9      0t0   10642899
pipe
> mesos-sla 30306            root  585r     FIFO                0,9      0t0   10642896
pipe
> mesos-sla 30306            root  586r     FIFO                0,9      0t0   10642901
pipe
> mesos-sla 30306            root  587r     FIFO                0,9      0t0   10642900
pipe
> mesos-sla 30306            root  588r     FIFO                0,9      0t0   10642904
pipe
> mesos-sla 30306            root  589r     FIFO                0,9      0t0   10642902
pipe
> mesos-sla 30306            root  590r     FIFO                0,9      0t0   10642906
pipe
> mesos-sla 30306            root  591r     FIFO                0,9      0t0   10642905
pipe
> mesos-sla 30306            root  592r     FIFO                0,9      0t0   10642908
pipe
> mesos-sla 30306            root  593r     FIFO                0,9      0t0   10642907
pipe
> mesos-sla 30306            root  594r     FIFO                0,9      0t0   10642910
pipe
> mesos-sla 30306            root  595r     FIFO                0,9      0t0   10642909
pipe
> mesos-sla 30306            root  596r     FIFO                0,9      0t0   10642918
pipe
> mesos-sla 30306            root  597r     FIFO                0,9      0t0   10642911
pipe
> mesos-sla 30306            root  598r     FIFO                0,9      0t0   10642920
pipe
> mesos-sla 30306            root  599r     FIFO                0,9      0t0   10642919
pipe
> mesos-sla 30306            root  600r     FIFO                0,9      0t0   10642922
pipe
> mesos-sla 30306            root  601r     FIFO                0,9      0t0   10642921
pipe
> mesos-sla 30306            root  602r     FIFO                0,9      0t0   10642924
pipe
> mesos-sla 30306            root  603r     FIFO                0,9      0t0   10642923
pipe
> mesos-sla 30306            root  604r     FIFO                0,9      0t0   10642926
pipe
> mesos-sla 30306            root  605r     FIFO                0,9      0t0   10642925
pipe
> mesos-sla 30306            root  606r     FIFO                0,9      0t0   10642928
pipe
> mesos-sla 30306            root  607r     FIFO                0,9      0t0   10642927
pipe
> mesos-sla 30306            root  608r     FIFO                0,9      0t0   10642933
pipe
> mesos-sla 30306            root  609r     FIFO                0,9      0t0   10642929
pipe
> mesos-sla 30306            root  610r     FIFO                0,9      0t0   10642935
pipe
> mesos-sla 30306            root  611r     FIFO                0,9      0t0   10642934
pipe
> mesos-sla 30306            root  612r     FIFO                0,9      0t0   10642937
pipe
> mesos-sla 30306            root  613r     FIFO                0,9      0t0   10642936
pipe
> mesos-sla 30306            root  614r     FIFO                0,9      0t0   10642939
pipe
> mesos-sla 30306            root  615r     FIFO                0,9      0t0   10642938
pipe
> mesos-sla 30306            root  616r     FIFO                0,9      0t0   10642949
pipe
> mesos-sla 30306            root  617r     FIFO                0,9      0t0   10642940
pipe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message