mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sharma Podila <spod...@netflix.com>
Subject Re: mesos agent not recovering after ZK init failure
Date Wed, 10 Feb 2016 17:54:37 GMT
Hi Ben,

That is accurate, with one additional line:

-Agent running fine with 0.24.1
-Transient ZK issues, slave flapping with zookeeper_init failure
-ZK issue resolved
-Most agents stop flapping and function correctly
-Some agents continue flapping, but silent exit after printing the
detector.cpp:481 log line.
-The agents that continue to flap repaired with manual removal of contents
in mesos-slave's working dir



On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <bmahler@apache.org> wrote:

> Hey Sharma,
>
> I didn't quite follow the timeline of events here or how the agent logs
> you posted fit into the timeline of events. Here's how I interpreted:
>
> -Agent running fine with 0.24.1
> -Transient ZK issues, slave flapping with zookeeper_init failure
> -ZK issue resolved
> -Most agents stop flapping and function correctly
> -Some agents continue flapping, but silent exit after printing the
> detector.cpp:481 log line.
>
> Is this accurate? What is the exit code from the silent exit?
>
> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <spodila@netflix.com> wrote:
>
>> Maybe related, but, maybe different since a new process seems to find the
>> master leader and still aborts, never recovering with restarts until work
>> dir data is removed.
>> It is happening in 0.24.1.
>>
>>
>>
>>
>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <vinodkone@apache.org> wrote:
>>
>>> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I guess
>>> you are saying it is somehow related but not exactly the same issue?
>>>
>>> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés <
>>> rgs@itevenworks.net> wrote:
>>>
>>>> On 9 February 2016 at 11:04, Sharma Podila <spodila@netflix.com> wrote:
>>>>
>>>>> We had a few mesos agents stuck in an unrecoverable state after a
>>>>> transient ZK init error. Is this a known problem? I wasn't able to find
an
>>>>> existing jira item for this. We are on 0.24.1 at this time.
>>>>>
>>>>> Most agents were fine, except a handful. These handful of agents had
>>>>> their mesos-slave process constantly restarting. The .INFO logfile had
the
>>>>> following contents below, before the process exited, with no error
>>>>> messages. The restarts were happening constantly due to an existing service
>>>>> keep alive strategy.
>>>>>
>>>>> To fix it, we manually stopped the service, removed the data in the
>>>>> working dir, and then restarted it. The mesos-slave process was able
to
>>>>> restart then. The manual intervention needed to resolve it is problematic.
>>>>>
>>>>> Here's the contents of the various log files on the agent:
>>>>>
>>>>> The .INFO logfile for one of the restarts before mesos-slave process
>>>>> exited with no other error messages:
>>>>>
>>>>> Log file created at: 2016/02/09 02:12:48
>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
>>>>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging
>>>>> started!
>>>>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07
>>>>> by builds
>>>>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
>>>>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation:
>>>>> posix/cpu,posix/mem,filesystem/posix
>>>>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
>>>>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
>>>>> 10.138.146.230:7101
>>>>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
>>>>> --appc_store_dir="/tmp/mesos/store/appc"
>>>>> --attributes="region:us-east-1;<snip>" --authenticatee="<snip>"
>>>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>>>> --container_disk_watch_interval="15secs" --containerizers="mesos" <snip>"
>>>>> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
>>>>> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
>>>>> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <snip>
>>>>> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
>>>>> I0209 02:12:48.516139 97299 group.cpp:331] Group process (group(1)@
>>>>> 10.138.146.230:7101) connected to ZooKeeper
>>>>> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations:
>>>>> queue size (joins, cancels, datas) = (0, 0, 0)
>>>>> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path
>>>>> '/titus/main/mesos' in ZooKeeper
>>>>> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader:
>>>>> (id='209')
>>>>> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get
>>>>> '/titus/main/mesos/json.info_0000000209' in ZooKeeper
>>>>> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from
>>>>> '/mnt/data/mesos/meta'
>>>>> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources
>>>>> file '/mnt/data/mesos/meta/resources/resources.info'
>>>>> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master
>>>>> (UPID=master@10.230.95.110:7103) is detected
>>>>>
>>>>>
>>>>> The .FATAL log file when the original transient ZK error occurred:
>>>>>
>>>>> Log file created at: 2016/02/05 17:21:37
>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>>>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>>>
>>>>>
>>>>> The .ERROR log file:
>>>>>
>>>>> Log file created at: 2016/02/05 17:21:37
>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>>>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>>>
>>>>> The .WARNING file had the same content.
>>>>>
>>>>
>>>> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326
>>>>
>>>>
>>>> -rgs
>>>>
>>>>
>>>
>>
>

Mime
View raw message