Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id DCC2E200CC3 for ; Sat, 15 Jul 2017 18:21:27 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id DAEE9168321; Sat, 15 Jul 2017 16:21:27 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (unknown [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0738D16831B for ; Sat, 15 Jul 2017 18:21:26 +0200 (CEST) Received: (qmail 49488 invoked by uid 500); 15 Jul 2017 16:21:24 -0000 Mailing-List: contact dev-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.apache.org Delivered-To: mailing list dev@aurora.apache.org Received: (qmail 49477 invoked by uid 99); 15 Jul 2017 16:21:24 -0000 Received: from Unknown (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Jul 2017 16:21:24 +0000 Received: from mail-qk0-f179.google.com (mail-qk0-f179.google.com [209.85.220.179]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id A37FD1A0814 for ; Sat, 15 Jul 2017 16:21:23 +0000 (UTC) Received: by mail-qk0-f179.google.com with SMTP id a66so81792402qkb.0 for ; Sat, 15 Jul 2017 09:21:23 -0700 (PDT) X-Gm-Message-State: AIVw112WtaE4JLsTEXDXHwAAvIHK/P0G+/CoZwgnWwCLNNqGnOoCRe7t Wo0qD7UgZ/fLlkOFpNjZVZDaIQxNFU7w X-Received: by 10.55.89.67 with SMTP id n64mr18166227qkb.158.1500135681693; Sat, 15 Jul 2017 09:21:21 -0700 (PDT) MIME-Version: 1.0 Received: by 10.140.89.85 with HTTP; Sat, 15 Jul 2017 09:21:21 -0700 (PDT) X-Originating-IP: [50.1.46.129] In-Reply-To: References: <1326408418.5087954.1499988733637.ref@mail.yahoo.com> <1326408418.5087954.1499988733637@mail.yahoo.com> <4F8A36BA-2ABB-48F3-B95E-9B56E94248DA@yahoo.com> From: David McLaughlin Date: Sat, 15 Jul 2017 09:21:21 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Aurora reconciliation and Master fail over To: dev@aurora.apache.org Cc: Renan DelValle Content-Type: multipart/alternative; boundary="001a114c97606c5c8f05545d8e43" archived-at: Sat, 15 Jul 2017 16:21:28 -0000 --001a114c97606c5c8f05545d8e43 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Yes, we've confirmed this internally too (Santhosh did the work here): When an agent becomes unreachable while the master is running, it sends > TASK_LOST events for each task on the agent. > https://github.com/apache/mesos/blob/33093c893773f8c9d293afe38e9909 > f9a2868d32/src/master/master.cpp#L7066-L7107 > Marking agent unreachable after failover does not cause TASK_LOST events. > https://github.com/apache/mesos/blob/33093c893773f8c9d293afe38e9909 > f9a2868d32/src/master/master.cpp#L2036-L2070 > Once an agent re-registers it sends TASK_LOST events. Agent sending > TASK_LOST for tasks that it does not know after a Master failover. > https://github.com/apache/mesos/blob/33093c893773f8c9d293afe38e9909 > f9a2868d32/src/slave/slave.cpp#L1324-L1383 The separate code path for markUnreachableAfterFailover appears to have been added by this commit: https://github.com/apache/mesos/commit/937c85f2f6528d1ac56ea9a7aa174ca0bd37= 1d0c And I think this totally breaks the promise of introducing the PARTITION_AWARE stuff in a backwards-compatible way. So right now, yes we rely on reconciliation to finally mark the tasks as LOST and reschedule their replacements. I think the only reason we haven't been more impacted by this at Twitter is our Mesos master is remarkably stable (compared to Aurora's daily failovers). We have two paths forward here: push forward and embrace the new partition awareness features in Aurora and/or push back on the above change with the Mesos community and have a better story for non-partition aware APIs in the short term. On Sat, Jul 15, 2017 at 2:01 AM, Meghdoot bhattacharya < meghdoot_b@yahoo.com.invalid> wrote: > We can reproduce it easily as the steps are > 1. Shut down leading mesos master > 2. Shutdown agent at same time > 3. Wait for 10 mins > > What Renan and I saw in the logs were only agent lost and not task lost > sent. While in regular health check expire scenario both task lost and > agent lost were sent. > > So yes this is very concerning. > > Thx > > > On Jul 14, 2017, at 10:28 AM, David McLaughlin > wrote: > > > > It would be interesting to see the logs. I think that will tell you if > the > > Mesos master is: > > > > a) Sending slaveLost > > b) Trying to send TASK_LOST > > > > And then the Scheduler logs (and/or the metrics it exports) should tell > you > > whether those events were received. If this is reproducible, I'd consid= er > > it a serious bug. > > > > On Fri, Jul 14, 2017 at 10:04 AM, Meghdoot bhattacharya < > > meghdoot_b@yahoo.com.invalid> wrote: > > > >> So in this situation why is not aurora replacing the tasks and waiting > for > >> external recon to fix it. > >> > >> This is different when the 75 sec (5*15) health check of slave times o= ut > >> (no master failover), aurora replaces it on task lost message. > >> > >> Are you hinting we should ask mesos folks why in master fail over > >> reregistration timeout scenario why task lost not sent though slave lo= st > >> sent and from below docs task lost should have been sent. > >> > >> Because either mesos is not sending the right status or aurora is not > >> handling it. > >> > >> Thx > >> > >>> On Jul 14, 2017, at 8:21 AM, David McLaughlin > >> wrote: > >>> > >>> "1. When mesos sends slave lost after 10 mins in this situation , why > >> does > >>> aurora not act on it?" > >>> > >>> Because Mesos also sends TASK_LOST for every task running on the agen= t > >>> whenever it calls slaveLost: > >>> > >>> When it is time to remove an agent, the master removes the agent from > the > >>> list of registered agents in the master=E2=80=99s durable state > >>> replicated-log-internals/> > >> (this > >>> will survive master failover). The master sends a slaveLost callback = to > >>> every registered scheduler driver; it also sends TASK_LOST status > updates > >>> for every task that was running on the removed agent. > >>> > >>> > >>> > >>> > >>> On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya < > >>> meghdoot_b@yahoo.com.invalid> wrote: > >>> > >>>> We were investigation slave re registration behavior on master fail > over > >>>> in Aurora 0.17 with mesos 1.1. > >>>> Few important points > >>>> http://mesos.apache.org/documentation/latest/high- > >>>> availability-framework-guide/ (If an agent does not reregister with > the > >>>> new master within a timeout (controlled by the > >> --agent_reregister_timeout > >>>> configuration flag), the master marks the agent as failed and follow= s > >> the > >>>> same steps described above. However, there is one difference: by > >> default, > >>>> agents are allowed to reconnect following master failover, even afte= r > >> the > >>>> agent_reregister_timeout has fired. This means that frameworks might > >> see a > >>>> TASK_LOST update for a task but then later discover that the task is > >>>> running (because the agent where it was running was allowed to > >> reconnect). > >>>> http://mesos.apache.org/documentation/latest/reconciliation/(Implici= t > >>>> reconciliation (passing an empty list) should also be used > >> periodically, as > >>>> a defense against data loss in the framework. Unless a strict regist= ry > >> is > >>>> in use on the master, its possible for tasks to resurrect from a LOS= T > >> state > >>>> (without a strict registry the master does not enforce agent removal > >> across > >>>> failovers). When an unknown task is encountered, the scheduler shoul= d > >> kill > >>>> or recover the task.) > >>>> https://issues.apache.org/jira/browse/MESOS-5951(Removes strict > >> registry > >>>> mode flag from 1.1 and reverts to the old behavior of non strict > >> registry > >>>> mode where tasks and executors were not killed on agent reregistrati= on > >>>> timeout on master failover) > >>>> So, what we find, if the slave does not come back after 10 mins > >>>> 1. Mesos master sends slave lost but not task lost to Aurora.2. Auro= ra > >>>> does not replace the tasks.3. When explicit recon starts , then only > >> this > >>>> gets corrected with aurora spawning replacement tasks. > >>>> If slave restarts after 10 mins > >>>> 1. When implicit recon starts, this situation gets fixed because in > >> aurora > >>>> it is marked as lost and mesos sends running and those get killed an= d > >>>> replaced. > >>>> So, questions > >>>> 1. When mesos sends slave lost after 10 mins in this situation , why > >> does > >>>> aurora not act on it?2. As per recon docs best practices, explicit > recon > >>>> should start followed by implicit recon on master failover. Looks li= ke > >>>> aurora is not doing that and the regular hourly recons are running > with > >> 30 > >>>> min spread between explicit and implicit. Should aurora do recon on > >> master > >>>> fail over? > >>>> > >>>> General questions1. What is the effect on aurora if we make explicit > >> recon > >>>> every 15 mins instead of default 1 hr? Does it slow down scheduling, > >> does > >>>> snapshot creation gets delayed etc? > >>>> 2. Any issue if spread between explicit recon and implicit recon > brought > >>>> down to 2 mins from 30 mins? probably depend on 1. > >>>> Thx > >>>> > >>>> > >>>> > >> > >> > > --001a114c97606c5c8f05545d8e43--