Return-Path: X-Original-To: apmail-mesos-user-archive@www.apache.org Delivered-To: apmail-mesos-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F0F3119B23 for ; Tue, 8 Mar 2016 01:54:45 +0000 (UTC) Received: (qmail 18421 invoked by uid 500); 8 Mar 2016 01:54:45 -0000 Delivered-To: apmail-mesos-user-archive@mesos.apache.org Received: (qmail 18345 invoked by uid 500); 8 Mar 2016 01:54:45 -0000 Mailing-List: contact user-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mesos.apache.org Delivered-To: mailing list user@mesos.apache.org Received: (qmail 18336 invoked by uid 99); 8 Mar 2016 01:54:45 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Mar 2016 01:54:45 +0000 Received: from mail-ob0-f175.google.com (mail-ob0-f175.google.com [209.85.214.175]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id F19C91A0176 for ; Tue, 8 Mar 2016 01:54:44 +0000 (UTC) Received: by mail-ob0-f175.google.com with SMTP id ts10so1167301obc.1 for ; Mon, 07 Mar 2016 17:54:44 -0800 (PST) X-Gm-Message-State: AD7BkJJxo1GpVOItKV2DbKlGT40Q1jwadQrYKAiMJEXbAZi1MugkRGDv/wb33UKhT7LPAep5Uhz5tneuFV8Ud4tP X-Received: by 10.60.50.98 with SMTP id b2mr15607790oeo.38.1457402083881; Mon, 07 Mar 2016 17:54:43 -0800 (PST) MIME-Version: 1.0 Received: by 10.76.39.2 with HTTP; Mon, 7 Mar 2016 17:54:24 -0800 (PST) In-Reply-To: References: From: Benjamin Mahler Date: Mon, 7 Mar 2016 17:54:24 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: mesos agent not recovering after ZK init failure To: user@mesos.apache.org Content-Type: multipart/alternative; boundary=001a11c1e276814fef052d7fddfe --001a11c1e276814fef052d7fddfe Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Very surprising.. I don't have any ideas other than trying to replicate the scenario in a test. Please do keep us posted if you encounter it again and gain more information. On Fri, Feb 26, 2016 at 4:34 PM, Sharma Podila wrote: > MESOS-4795 created. > > I don't have the exit status. We haven't seen a repeat yet, will catch th= e > exit status next time it happens. > > Yes, removing the metadata directory was the only way it was resolved. > This happened on multiple hosts requiring the same resolution. > > > On Thu, Feb 25, 2016 at 6:37 PM, Benjamin Mahler > wrote: > >> Feel free to create one. I don't have enough information to know what th= e >> issue is without doing some further investigation, but if the situation = you >> described is accurate it seems like a there are two strange bugs: >> >> -the silent exit (do you not have the exit status?), and >> -the flapping from ZK errors that needed the meta data directory to be >> removed to resolve (are you convinced the removal of the meta directory = is >> what solved it?) >> >> It would be good to track these issues in case they crop up again. >> >> On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila >> wrote: >> >>> Hi Ben, >>> >>> Let me know if there is a new issue created for this, I would like to >>> add myself to watch it. >>> Thanks. >>> >>> >>> >>> On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila >>> wrote: >>> >>>> Hi Ben, >>>> >>>> That is accurate, with one additional line: >>>> >>>> -Agent running fine with 0.24.1 >>>> -Transient ZK issues, slave flapping with zookeeper_init failure >>>> -ZK issue resolved >>>> -Most agents stop flapping and function correctly >>>> -Some agents continue flapping, but silent exit after printing the >>>> detector.cpp:481 log line. >>>> -The agents that continue to flap repaired with manual removal of >>>> contents in mesos-slave's working dir >>>> >>>> >>>> >>>> On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler >>>> wrote: >>>> >>>>> Hey Sharma, >>>>> >>>>> I didn't quite follow the timeline of events here or how the agent >>>>> logs you posted fit into the timeline of events. Here's how I interpr= eted: >>>>> >>>>> -Agent running fine with 0.24.1 >>>>> -Transient ZK issues, slave flapping with zookeeper_init failure >>>>> -ZK issue resolved >>>>> -Most agents stop flapping and function correctly >>>>> -Some agents continue flapping, but silent exit after printing the >>>>> detector.cpp:481 log line. >>>>> >>>>> Is this accurate? What is the exit code from the silent exit? >>>>> >>>>> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila >>>>> wrote: >>>>> >>>>>> Maybe related, but, maybe different since a new process seems to fin= d >>>>>> the master leader and still aborts, never recovering with restarts u= ntil >>>>>> work dir data is removed. >>>>>> It is happening in 0.24.1. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone >>>>>> wrote: >>>>>> >>>>>>> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I >>>>>>> guess you are saying it is somehow related but not exactly the same= issue? >>>>>>> >>>>>>> On Tue, Feb 9, 2016 at 11:46 AM, Ra=C3=BAl Guti=C3=A9rrez Segal=C3= =A9s < >>>>>>> rgs@itevenworks.net> wrote: >>>>>>> >>>>>>>> On 9 February 2016 at 11:04, Sharma Podila >>>>>>>> wrote: >>>>>>>> >>>>>>>>> We had a few mesos agents stuck in an unrecoverable state after a >>>>>>>>> transient ZK init error. Is this a known problem? I wasn't able t= o find an >>>>>>>>> existing jira item for this. We are on 0.24.1 at this time. >>>>>>>>> >>>>>>>>> Most agents were fine, except a handful. These handful of agents >>>>>>>>> had their mesos-slave process constantly restarting. The .INFO lo= gfile had >>>>>>>>> the following contents below, before the process exited, with no = error >>>>>>>>> messages. The restarts were happening constantly due to an existi= ng service >>>>>>>>> keep alive strategy. >>>>>>>>> >>>>>>>>> To fix it, we manually stopped the service, removed the data in >>>>>>>>> the working dir, and then restarted it. The mesos-slave process w= as able to >>>>>>>>> restart then. The manual intervention needed to resolve it is pro= blematic. >>>>>>>>> >>>>>>>>> Here's the contents of the various log files on the agent: >>>>>>>>> >>>>>>>>> The .INFO logfile for one of the restarts before mesos-slave >>>>>>>>> process exited with no other error messages: >>>>>>>>> >>>>>>>>> Log file created at: 2016/02/09 02:12:48 >>>>>>>>> Running on machine: titusagent-main-i-7697a9c5 >>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] m= sg >>>>>>>>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging >>>>>>>>> started! >>>>>>>>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 >>>>>>>>> 16:12:07 by builds >>>>>>>>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1 >>>>>>>>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using >>>>>>>>> isolation: posix/cpu,posix/mem,filesystem/posix >>>>>>>>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave >>>>>>>>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@ >>>>>>>>> 10.138.146.230:7101 >>>>>>>>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup: >>>>>>>>> --appc_store_dir=3D"/tmp/mesos/store/appc" >>>>>>>>> --attributes=3D"region:us-east-1;" --authenticatee=3D"" >>>>>>>>> --cgroups_cpu_enable_pids_and_tids_count=3D"false" >>>>>>>>> --cgroups_enable_cfs=3D"false" --cgroups_hierarchy=3D"/sys/fs/cgr= oup" >>>>>>>>> --cgroups_limit_swap=3D"false" --cgroups_root=3D"mesos" >>>>>>>>> --container_disk_watch_interval=3D"15secs" --containerizers=3D"me= sos" " >>>>>>>>> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources: >>>>>>>>> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104 >>>>>>>>> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: >>>>>>>>> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true >>>>>>>>> I0209 02:12:48.516139 97299 group.cpp:331] Group process (group(1= )@ >>>>>>>>> 10.138.146.230:7101) connected to ZooKeeper >>>>>>>>> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group >>>>>>>>> operations: queue size (joins, cancels, datas) =3D (0, 0, 0) >>>>>>>>> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path >>>>>>>>> '/titus/main/mesos' in ZooKeeper >>>>>>>>> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new >>>>>>>>> leader: (id=3D'209') >>>>>>>>> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get >>>>>>>>> '/titus/main/mesos/json.info_0000000209' in ZooKeeper >>>>>>>>> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from >>>>>>>>> '/mnt/data/mesos/meta' >>>>>>>>> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find >>>>>>>>> resources file '/mnt/data/mesos/meta/resources/resources.info' >>>>>>>>> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading maste= r >>>>>>>>> (UPID=3Dmaster@10.230.95.110:7103) is detected >>>>>>>>> >>>>>>>>> >>>>>>>>> The .FATAL log file when the original transient ZK error occurred= : >>>>>>>>> >>>>>>>>> Log file created at: 2016/02/05 17:21:37 >>>>>>>>> Running on machine: titusagent-main-i-7697a9c5 >>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] m= sg >>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create >>>>>>>>> ZooKeeper, zookeeper_init: No such file or directory [2] >>>>>>>>> >>>>>>>>> >>>>>>>>> The .ERROR log file: >>>>>>>>> >>>>>>>>> Log file created at: 2016/02/05 17:21:37 >>>>>>>>> Running on machine: titusagent-main-i-7697a9c5 >>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] m= sg >>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create >>>>>>>>> ZooKeeper, zookeeper_init: No such file or directory [2] >>>>>>>>> >>>>>>>>> The .WARNING file had the same content. >>>>>>>>> >>>>>>>> >>>>>>>> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326 >>>>>>>> >>>>>>>> >>>>>>>> -rgs >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > --001a11c1e276814fef052d7fddfe Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Very surprising.. I don't have any ideas other than tr= ying to replicate the scenario in a test.

Please do keep= us posted if you encounter it again and gain more information.

On Fri, Feb 2= 6, 2016 at 4:34 PM, Sharma Podila <spodila@netflix.com> wr= ote:
MESOS-4795 created.<= br>

I don't have the exit status. We hav= en't seen a repeat yet, will catch the exit status next time it happens= .

Yes, removing the metadata directory was t= he only way it was resolved. This happened on multiple hosts requiring the = same resolution.=C2=A0


On Thu, Feb 25, 2016 at= 6:37 PM, Benjamin Mahler <bmahler@apache.org> wrote:
Feel free to create one. I don't h= ave enough information to know what the issue is without doing some further= investigation, but if the situation you described is accurate it seems lik= e a there are two strange bugs:

-the silent ex= it (do you not have the exit status?), and
-the flapping from ZK = errors that needed the meta data directory to be removed to resolve (are yo= u convinced the removal of the meta directory is what solved it?)

It would be good to track these issues in case they crop up= again.

On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <spodila@netf= lix.com> wrote:
= Hi Ben,=C2=A0

Let me know if there is a new = issue created for this, I would like to add myself to watch it.=C2=A0
=
Thanks.


=

On Wed, Feb 10, 2= 016 at 9:54 AM, Sharma Podila <spodila@netflix.com> wrote:=
Hi Ben,=C2=A0
That is accurate, with one additional line:
-Agent running fine with= 0.24.1
-Transient ZK issues, slave flapp= ing with zookeeper_init failure
-ZK issue= resolved
-Most agents stop flapping and = function correctly
-Some agents continue = flapping, but silent exit after printing the detector.cpp:481 log line.
-The agents that continue to flap repaired with manual remova= l of contents in mesos-slave's working dir



On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <bmahle= r@apache.org> wrote:
Hey= Sharma,

I didn't quite follow the timeline of event= s here or how the agent logs you posted fit into the timeline of events. He= re's how I interpreted:

-Agent running fin= e with 0.24.1
-Transient ZK issues, slave flapping with zookeeper= _init failure
-ZK issue resolved
-Most agents stop flap= ping and function correctly
-Some agents continue flapping, but s= ilent exit after printing the detector.cpp:481 log line.

Is this accurate? What is the exit code from the silent exit?

On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <spodila@netflix.com= > wrote:
Maybe relat= ed, but, maybe different since a new process seems to find the master leade= r and still aborts, never recovering with restarts until work dir data is r= emoved.=C2=A0
It is happening in 0.24.1.=C2=A0



On Tue, Feb 9, 2016 at 11:53 AM,= Vinod Kone <vinodkone@apache.org> wrote:
MESOS-1326 was fixed in 0.19.0 (set the fix version n= ow). But I guess you are saying it is somehow related but not exactly the s= ame issue?

On Tue, Feb 9, 2016 at 11:46 AM, Ra=C3=BAl Guti=C3=A9rrez Segal=C3= =A9s <rgs@itevenworks.net> wrote:
On 9 February 2016 at 11:04, Sharma Podila <spodi= la@netflix.com> wrote:
=
We had a few mes= os agents stuck in an unrecoverable state after a transient ZK init error. = Is this a known problem? I wasn't able to find an existing jira item fo= r this. We are on 0.24.1 at this time.

Most agents were fine, except a handful. These handfu= l of agents had their mesos-slave process constantly restarting. The .INFO = logfile had the following contents below, before the process exited, with n= o error messages. The restarts were happening constantly due to an existing= service keep alive strategy.

To fix it, we manually stopped the service, removed the data i= n the working dir, and then restarted it. The mesos-slave process was able = to restart then. The manual intervention needed to resolve it is problemati= c.

Here's the= contents of the various log files on the agent:

The .INFO logfile for one of the restarts b= efore mesos-slave process exited with no other error messages:

Log file created at: 201= 6/02/09 02:12:48
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm= :ss.uuuuuu threadid file:line] msg
I0209 02:12:48.502403 97255 logging.cpp:172] INFO l= evel logging started!
I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:1= 2:07 by builds
I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
I0209 02:12:48.503288 97= 255 containerizer.cpp:143] Using isolation: posix/cpu,posix/mem,filesystem/= posix
I0209= 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
I0209 02:12:48.509827 9729= 6 slave.cpp:190] Slave started on 1)@10.138.146.230:7101
I0209 02:12:48.510074 97296 slave.cpp:191] = Flags at startup: --appc_store_dir=3D"/tmp/mesos/store/appc" --at= tributes=3D"region:us-east-1;<snip>" --authenticatee=3D&quo= t;<snip>" --cgroups_cpu_enable_pids_and_tids_count=3D"false= " --cgroups_enable_cfs=3D"false" --cgroups_hierarchy=3D"= ;/sys/fs/cgroup" --cgroups_limit_swap=3D"false" --cgroups_ro= ot=3D"mesos" --container_disk_watch_interval=3D"15secs"= --containerizers=3D"mesos" <snip>"
I0209 02:12:48.511706 97296 s= lave.cpp:354] Slave resources: ports(*):[7150-7200]; mem(*):240135; cpus(*)= :32; disk(*):586104
I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <sni= p>
I0209= 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
I0209 02:12:48.516139 972= 99 group.cpp:331] Group process (group(1)@10.138.146.230:7101) connected to ZooKeeper
I0209 02:12:48= .516216 97299 group.cpp:805] Syncing group operations: queue size (joins, c= ancels, datas) =3D (0, 0, 0)
I0209 02:12:48.516253 97299 group.cpp:403] Trying to crea= te path '/titus/main/mesos' in ZooKeeper
I0209 02:12:48.520268 97275 detector.= cpp:156] Detected a new leader: (id=3D'209')
I0209 02:12:48.520803 97284 group= .cpp:674] Trying to get '/titus/main/mesos/json.info_0000000209' in= ZooKeeper
= I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from '/mnt/d= ata/mesos/meta'
I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resource= s file '/mnt/data/mesos/meta/resources/resources.info'
I0209 02:12:48.523680 97283 detector.cpp:4= 81] A new leading master (UPID=3Dmaster@10.230.95.110:7103) is detected


The .FATAL log file when the ori= ginal transient ZK error occurred:

Log file created at: 2016/02/05 17:21:37
Running on machine: titusagent-main-i-7697a9c5<= /font>
Log line format: [IW= EF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0205 17:21:37.395644 53841 zookeeper.cpp:110] = Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]


<= /div>
The .ERROR = log file:
<= br>
Lo= g file created at: 2016/02/05 17:21:37
Running on machine: titusa= gent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu = threadid file:line] msg
F0205 17:21:37.395644 53841 zookeeper.cpp= :110] Failed to create ZooKeeper, zookeeper_init: No such file or directory= [2]

The .WARNING file had the same content.=C2=A0=

-rgs









--001a11c1e276814fef052d7fddfe--