mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sharma Podila <spod...@netflix.com>
Subject Re: mesos agent not recovering after ZK init failure
Date Fri, 15 Jul 2016 19:22:27 GMT
Vinod,

MESOS-5854 <https://issues.apache.org/jira/browse/MESOS-5854> created. Feel
free to change the priority appropriately.

Yes, the workaround I mentioned for disk size is based on resource
specification, so that works for now.


On Fri, Jul 15, 2016 at 11:48 AM, Andrew Leung <aleung@netflix.com> wrote:

> Hi Jie,
>
> Yes, that is how we are working around this issue. However, we wanted to
> see if others were hitting this issue as well. If others had a similar
> Mesos Slave on ZFS setup, it might be worth considering a disk space
> calculation approach that works more reliably with ZFS or at least calling
> out the need to specify the disk resource explicitly.
>
> Thanks for the help.
> Andrew
>
> On Jul 15, 2016, at 11:41 AM, Jie Yu <yujie.jay@gmail.com> wrote:
>
> Can you hard code your disk size using --resources flag?
>
>
> On Fri, Jul 15, 2016 at 11:31 AM, Sharma Podila <spodila@netflix.com>
> wrote:
>
>> We had this issue happen again and were able to debug further. The cause
>> for agent not being able to restart is that one of the resources (disk)
>> changed its total size since the last restart. However, this error does not
>> show up in INFO/WARN/ERROR files. We saw it in stdout only when manually
>> restarting the agent. It would be good to have all messages going to
>> stdout/stderr show up in the logs. Is there a config setting for it that I
>> missed?
>>
>> The disk size total is changing sometimes on our agents. It is off by a
>> few bytes (seeing ~10 bytes difference out of, say, 600 GB). We use ZFS on
>> our agents to manage the disk partition. From my colleague, Andrew (copied
>> here):
>>
>> The current Mesos approach (i.e., `statvfs()` for total blocks and assume
>>> that never changes) won’t work reliably on ZFS
>>>
>>
>> Anyone else experience this? We can likely hack a workaround for this by
>> reporting the "whole GBs" of the disk so we are insensitive to small
>> changes in the total size. But, not sure if the changes can be larger due
>> to Andrew's point above.
>>
>>
>> On Mon, Mar 7, 2016 at 6:00 PM, Sharma Podila <spodila@netflix.com>
>> wrote:
>>
>>> Sure, will do.
>>>
>>>
>>> On Mon, Mar 7, 2016 at 5:54 PM, Benjamin Mahler <bmahler@apache.org>
>>> wrote:
>>>
>>>> Very surprising.. I don't have any ideas other than trying to replicate
>>>> the scenario in a test.
>>>>
>>>> Please do keep us posted if you encounter it again and gain more
>>>> information.
>>>>
>>>> On Fri, Feb 26, 2016 at 4:34 PM, Sharma Podila <spodila@netflix.com>
>>>> wrote:
>>>>
>>>>> MESOS-4795 created.
>>>>>
>>>>> I don't have the exit status. We haven't seen a repeat yet, will catch
>>>>> the exit status next time it happens.
>>>>>
>>>>> Yes, removing the metadata directory was the only way it was resolved.
>>>>> This happened on multiple hosts requiring the same resolution.
>>>>>
>>>>>
>>>>> On Thu, Feb 25, 2016 at 6:37 PM, Benjamin Mahler <bmahler@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Feel free to create one. I don't have enough information to know
what
>>>>>> the issue is without doing some further investigation, but if the
situation
>>>>>> you described is accurate it seems like a there are two strange bugs:
>>>>>>
>>>>>> -the silent exit (do you not have the exit status?), and
>>>>>> -the flapping from ZK errors that needed the meta data directory
to
>>>>>> be removed to resolve (are you convinced the removal of the meta
directory
>>>>>> is what solved it?)
>>>>>>
>>>>>> It would be good to track these issues in case they crop up again.
>>>>>>
>>>>>> On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <spodila@netflix.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ben,
>>>>>>>
>>>>>>> Let me know if there is a new issue created for this, I would
like
>>>>>>> to add myself to watch it.
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <spodila@netflix.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Ben,
>>>>>>>>
>>>>>>>> That is accurate, with one additional line:
>>>>>>>>
>>>>>>>> -Agent running fine with 0.24.1
>>>>>>>> -Transient ZK issues, slave flapping with zookeeper_init
failure
>>>>>>>> -ZK issue resolved
>>>>>>>> -Most agents stop flapping and function correctly
>>>>>>>> -Some agents continue flapping, but silent exit after printing
the
>>>>>>>> detector.cpp:481 log line.
>>>>>>>> -The agents that continue to flap repaired with manual removal
of
>>>>>>>> contents in mesos-slave's working dir
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <
>>>>>>>> bmahler@apache.org> wrote:
>>>>>>>>
>>>>>>>>> Hey Sharma,
>>>>>>>>>
>>>>>>>>> I didn't quite follow the timeline of events here or
how the agent
>>>>>>>>> logs you posted fit into the timeline of events. Here's
how I interpreted:
>>>>>>>>>
>>>>>>>>> -Agent running fine with 0.24.1
>>>>>>>>> -Transient ZK issues, slave flapping with zookeeper_init
failure
>>>>>>>>> -ZK issue resolved
>>>>>>>>> -Most agents stop flapping and function correctly
>>>>>>>>> -Some agents continue flapping, but silent exit after
printing the
>>>>>>>>> detector.cpp:481 log line.
>>>>>>>>>
>>>>>>>>> Is this accurate? What is the exit code from the silent
exit?
>>>>>>>>>
>>>>>>>>> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <spodila@netflix.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Maybe related, but, maybe different since a new process
seems to
>>>>>>>>>> find the master leader and still aborts, never recovering
with restarts
>>>>>>>>>> until work dir data is removed.
>>>>>>>>>> It is happening in 0.24.1.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <vinodkone@apache.org
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> MESOS-1326 was fixed in 0.19.0 (set the fix version
now). But I
>>>>>>>>>>> guess you are saying it is somehow related but
not exactly the same issue?
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez
Segalés <
>>>>>>>>>>> rgs@itevenworks.net> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 9 February 2016 at 11:04, Sharma Podila
<spodila@netflix.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> We had a few mesos agents stuck in an
unrecoverable state
>>>>>>>>>>>>> after a transient ZK init error. Is this
a known problem? I wasn't able to
>>>>>>>>>>>>> find an existing jira item for this.
We are on 0.24.1 at this time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Most agents were fine, except a handful.
These handful of
>>>>>>>>>>>>> agents had their mesos-slave process
constantly restarting. The .INFO
>>>>>>>>>>>>> logfile had the following contents below,
before the process exited, with
>>>>>>>>>>>>> no error messages. The restarts were
happening constantly due to an
>>>>>>>>>>>>> existing service keep alive strategy.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To fix it, we manually stopped the service,
removed the data
>>>>>>>>>>>>> in the working dir, and then restarted
it. The mesos-slave process was able
>>>>>>>>>>>>> to restart then. The manual intervention
needed to resolve it is
>>>>>>>>>>>>> problematic.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here's the contents of the various log
files on the agent:
>>>>>>>>>>>>>
>>>>>>>>>>>>> The .INFO logfile for one of the restarts
before mesos-slave
>>>>>>>>>>>>> process exited with no other error messages:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Log file created at: 2016/02/09 02:12:48
>>>>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu
threadid
>>>>>>>>>>>>> file:line] msg
>>>>>>>>>>>>> I0209 02:12:48.502403 97255 logging.cpp:172]
INFO level
>>>>>>>>>>>>> logging started!
>>>>>>>>>>>>> I0209 02:12:48.502938 97255 main.cpp:185]
Build: 2015-09-30
>>>>>>>>>>>>> 16:12:07 by builds
>>>>>>>>>>>>> I0209 02:12:48.502974 97255 main.cpp:187]
Version: 0.24.1
>>>>>>>>>>>>> I0209 02:12:48.503288 97255 containerizer.cpp:143]
Using
>>>>>>>>>>>>> isolation: posix/cpu,posix/mem,filesystem/posix
>>>>>>>>>>>>> I0209 02:12:48.507961 97255 main.cpp:272]
Starting Mesos slave
>>>>>>>>>>>>> I0209 02:12:48.509827 97296 slave.cpp:190]
Slave started on 1)@
>>>>>>>>>>>>> 10.138.146.230:7101
>>>>>>>>>>>>> I0209 02:12:48.510074 97296 slave.cpp:191]
Flags at startup:
>>>>>>>>>>>>> --appc_store_dir="/tmp/mesos/store/appc"
>>>>>>>>>>>>> --attributes="region:us-east-1;<snip>"
--authenticatee="<snip>"
>>>>>>>>>>>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>>>>>>>>>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>>>>>>>>>>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>>>>>>>>>>>> --container_disk_watch_interval="15secs"
--containerizers="mesos" <snip>"
>>>>>>>>>>>>> I0209 02:12:48.511706 97296 slave.cpp:354]
Slave resources:
>>>>>>>>>>>>> ports(*):[7150-7200]; mem(*):240135;
cpus(*):32; disk(*):586104
>>>>>>>>>>>>> I0209 02:12:48.512320 97296 slave.cpp:384]
Slave hostname:
>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>> I0209 02:12:48.512368 97296 slave.cpp:389]
Slave checkpoint:
>>>>>>>>>>>>> true
>>>>>>>>>>>>> I0209 02:12:48.516139 97299 group.cpp:331]
Group process
>>>>>>>>>>>>> (group(1)@10.138.146.230:7101) connected
to ZooKeeper
>>>>>>>>>>>>> I0209 02:12:48.516216 97299 group.cpp:805]
Syncing group
>>>>>>>>>>>>> operations: queue size (joins, cancels,
datas) = (0, 0, 0)
>>>>>>>>>>>>> I0209 02:12:48.516253 97299 group.cpp:403]
Trying to create
>>>>>>>>>>>>> path '/titus/main/mesos' in ZooKeeper
>>>>>>>>>>>>> I0209 02:12:48.520268 97275 detector.cpp:156]
Detected a new
>>>>>>>>>>>>> leader: (id='209')
>>>>>>>>>>>>> I0209 02:12:48.520803 97284 group.cpp:674]
Trying to get
>>>>>>>>>>>>> '/titus/main/mesos/json.info_0000000209'
in ZooKeeper
>>>>>>>>>>>>> I0209 02:12:48.520874 97278 state.cpp:54]
Recovering state
>>>>>>>>>>>>> from '/mnt/data/mesos/meta'
>>>>>>>>>>>>> I0209 02:12:48.520961 97278 state.cpp:690]
Failed to find
>>>>>>>>>>>>> resources file '/mnt/data/mesos/meta/resources/resources.info'
>>>>>>>>>>>>> I0209 02:12:48.523680 97283 detector.cpp:481]
A new leading
>>>>>>>>>>>>> master (UPID=master@10.230.95.110:7103)
is detected
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The .FATAL log file when the original
transient ZK error
>>>>>>>>>>>>> occurred:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Log file created at: 2016/02/05 17:21:37
>>>>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu
threadid
>>>>>>>>>>>>> file:line] msg
>>>>>>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110]
Failed to
>>>>>>>>>>>>> create ZooKeeper, zookeeper_init: No
such file or directory [2]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The .ERROR log file:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Log file created at: 2016/02/05 17:21:37
>>>>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu
threadid
>>>>>>>>>>>>> file:line] msg
>>>>>>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110]
Failed to
>>>>>>>>>>>>> create ZooKeeper, zookeeper_init: No
such file or directory [2]
>>>>>>>>>>>>>
>>>>>>>>>>>>> The .WARNING file had the same content.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -rgs
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Mime
View raw message