mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sharma Podila <spod...@netflix.com>
Subject Re: mesos agent not recovering after ZK init failure
Date Sat, 27 Feb 2016 00:34:08 GMT
MESOS-4795 created.

I don't have the exit status. We haven't seen a repeat yet, will catch the
exit status next time it happens.

Yes, removing the metadata directory was the only way it was resolved. This
happened on multiple hosts requiring the same resolution.


On Thu, Feb 25, 2016 at 6:37 PM, Benjamin Mahler <bmahler@apache.org> wrote:

> Feel free to create one. I don't have enough information to know what the
> issue is without doing some further investigation, but if the situation you
> described is accurate it seems like a there are two strange bugs:
>
> -the silent exit (do you not have the exit status?), and
> -the flapping from ZK errors that needed the meta data directory to be
> removed to resolve (are you convinced the removal of the meta directory is
> what solved it?)
>
> It would be good to track these issues in case they crop up again.
>
> On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <spodila@netflix.com>
> wrote:
>
>> Hi Ben,
>>
>> Let me know if there is a new issue created for this, I would like to add
>> myself to watch it.
>> Thanks.
>>
>>
>>
>> On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <spodila@netflix.com>
>> wrote:
>>
>>> Hi Ben,
>>>
>>> That is accurate, with one additional line:
>>>
>>> -Agent running fine with 0.24.1
>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>> -ZK issue resolved
>>> -Most agents stop flapping and function correctly
>>> -Some agents continue flapping, but silent exit after printing the
>>> detector.cpp:481 log line.
>>> -The agents that continue to flap repaired with manual removal of
>>> contents in mesos-slave's working dir
>>>
>>>
>>>
>>> On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <bmahler@apache.org>
>>> wrote:
>>>
>>>> Hey Sharma,
>>>>
>>>> I didn't quite follow the timeline of events here or how the agent logs
>>>> you posted fit into the timeline of events. Here's how I interpreted:
>>>>
>>>> -Agent running fine with 0.24.1
>>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>>> -ZK issue resolved
>>>> -Most agents stop flapping and function correctly
>>>> -Some agents continue flapping, but silent exit after printing the
>>>> detector.cpp:481 log line.
>>>>
>>>> Is this accurate? What is the exit code from the silent exit?
>>>>
>>>> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <spodila@netflix.com>
>>>> wrote:
>>>>
>>>>> Maybe related, but, maybe different since a new process seems to find
>>>>> the master leader and still aborts, never recovering with restarts until
>>>>> work dir data is removed.
>>>>> It is happening in 0.24.1.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <vinodkone@apache.org>
>>>>> wrote:
>>>>>
>>>>>> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I guess
>>>>>> you are saying it is somehow related but not exactly the same issue?
>>>>>>
>>>>>> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés <
>>>>>> rgs@itevenworks.net> wrote:
>>>>>>
>>>>>>> On 9 February 2016 at 11:04, Sharma Podila <spodila@netflix.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> We had a few mesos agents stuck in an unrecoverable state
after a
>>>>>>>> transient ZK init error. Is this a known problem? I wasn't
able to find an
>>>>>>>> existing jira item for this. We are on 0.24.1 at this time.
>>>>>>>>
>>>>>>>> Most agents were fine, except a handful. These handful of
agents
>>>>>>>> had their mesos-slave process constantly restarting. The
.INFO logfile had
>>>>>>>> the following contents below, before the process exited,
with no error
>>>>>>>> messages. The restarts were happening constantly due to an
existing service
>>>>>>>> keep alive strategy.
>>>>>>>>
>>>>>>>> To fix it, we manually stopped the service, removed the data
in the
>>>>>>>> working dir, and then restarted it. The mesos-slave process
was able to
>>>>>>>> restart then. The manual intervention needed to resolve it
is problematic.
>>>>>>>>
>>>>>>>> Here's the contents of the various log files on the agent:
>>>>>>>>
>>>>>>>> The .INFO logfile for one of the restarts before mesos-slave
>>>>>>>> process exited with no other error messages:
>>>>>>>>
>>>>>>>> Log file created at: 2016/02/09 02:12:48
>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line]
msg
>>>>>>>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging
>>>>>>>> started!
>>>>>>>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30
>>>>>>>> 16:12:07 by builds
>>>>>>>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
>>>>>>>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using
isolation:
>>>>>>>> posix/cpu,posix/mem,filesystem/posix
>>>>>>>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos
slave
>>>>>>>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started
on 1)@
>>>>>>>> 10.138.146.230:7101
>>>>>>>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
>>>>>>>> --appc_store_dir="/tmp/mesos/store/appc"
>>>>>>>> --attributes="region:us-east-1;<snip>" --authenticatee="<snip>"
>>>>>>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>>>>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>>>>>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>>>>>>> --container_disk_watch_interval="15secs" --containerizers="mesos"
<snip>"
>>>>>>>> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
>>>>>>>> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
>>>>>>>> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname:
<snip>
>>>>>>>> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint:
true
>>>>>>>> I0209 02:12:48.516139 97299 group.cpp:331] Group process
(group(1)@
>>>>>>>> 10.138.146.230:7101) connected to ZooKeeper
>>>>>>>> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group
>>>>>>>> operations: queue size (joins, cancels, datas) = (0, 0, 0)
>>>>>>>> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create
path
>>>>>>>> '/titus/main/mesos' in ZooKeeper
>>>>>>>> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a
new
>>>>>>>> leader: (id='209')
>>>>>>>> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get
>>>>>>>> '/titus/main/mesos/json.info_0000000209' in ZooKeeper
>>>>>>>> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state
from
>>>>>>>> '/mnt/data/mesos/meta'
>>>>>>>> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find
resources
>>>>>>>> file '/mnt/data/mesos/meta/resources/resources.info'
>>>>>>>> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading
master
>>>>>>>> (UPID=master@10.230.95.110:7103) is detected
>>>>>>>>
>>>>>>>>
>>>>>>>> The .FATAL log file when the original transient ZK error
occurred:
>>>>>>>>
>>>>>>>> Log file created at: 2016/02/05 17:21:37
>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line]
msg
>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to
create
>>>>>>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>>>>>>
>>>>>>>>
>>>>>>>> The .ERROR log file:
>>>>>>>>
>>>>>>>> Log file created at: 2016/02/05 17:21:37
>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line]
msg
>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to
create
>>>>>>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>>>>>>
>>>>>>>> The .WARNING file had the same content.
>>>>>>>>
>>>>>>>
>>>>>>> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326
>>>>>>>
>>>>>>>
>>>>>>> -rgs
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message