Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 49851200CC6 for ; Tue, 18 Jul 2017 19:45:33 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 47FAE1673CC; Tue, 18 Jul 2017 17:45:33 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 197F21673C9 for ; Tue, 18 Jul 2017 19:45:31 +0200 (CEST) Received: (qmail 7380 invoked by uid 500); 18 Jul 2017 17:45:29 -0000 Mailing-List: contact dev-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.apache.org Delivered-To: mailing list dev@aurora.apache.org Received: (qmail 7367 invoked by uid 99); 18 Jul 2017 17:45:29 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Jul 2017 17:45:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 885F71A0A7C for ; Tue, 18 Jul 2017 17:45:28 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.051 X-Spam-Level: X-Spam-Status: No, score=-0.051 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.8, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=binghamton-edu.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id hJfFeFRzTnU6 for ; Tue, 18 Jul 2017 17:45:22 +0000 (UTC) Received: from mail-wr0-f176.google.com (mail-wr0-f176.google.com [209.85.128.176]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id AF0FB5F3F2 for ; Tue, 18 Jul 2017 17:45:21 +0000 (UTC) Received: by mail-wr0-f176.google.com with SMTP id y43so39395169wrd.3 for ; Tue, 18 Jul 2017 10:45:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=binghamton-edu.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=cEI/+TK99goXfv8pWRwXrEFDAfkirdc5joqe4BUVumk=; b=VUdymFGXrRlQ+D8Hcmxu83VNpDybDDikJ0pkFHpnwBNK278prZaXNGi8RzpEHNusoW KBGW3VZV0lLoYw2Zbf7W9N3Uo88RtfKpZu9/G6ljNoUqSczBFyU/NG16hsXcH4PcqSMk CbbcJDBWCCXNM9ytLeQ7tY05N+b/zYJ45NhGUHIizHQG+Ce2yLwbt06X8L9LsH9IfLWK OLYeyYOTpo/0Mgp/0y8i/D1OP3AEAYk8/X2yodVITB3Osc1KvjjoMBfJ/t7QjnW37GBP C0wSR1qBThvEHG9Wi0HUngO2IFJr8hdhsdDU4wKzGaefgTfZgx2POosncC+oJ2H8m8or PJqQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=cEI/+TK99goXfv8pWRwXrEFDAfkirdc5joqe4BUVumk=; b=mvXvIUjX7VK0hMoXN66Q6H1q449/V1clBYKHu3yXYH2xy5yaTBGY76u4ryr59ZdUt8 B5LesFSSi+gyh/05qxppSgu+DYx/SHyCTcat6VoXsKUZeYdKraFzVS/4o+AY8gUK1Zow Zx9lCtk0yK7j+ksU+4bmJVhOUkJd9sxmCG5O/y40XfzhTpD2G4/eKQP2bVegyOVRlowL Q3F6Y4ER7rSmkdimhTrdaquiuUgYDhATwhq5FLDSI5J56+mZUeKc/sP3tuaosyOvapc0 Fr2VbkbqBgV8OHzfyCGOIr3eCAMl+cXmFjtpKVnHR2tkf3pU89X5e9qJd8QfREPYyc4/ 6Etg== X-Gm-Message-State: AIVw110pMXbW5hpEOpE5AZ6PrgVbdpkIJkaqslRjeKWQPLRhkiJqkiNd sY5f223i3nQvBoW14ui3wml+ScIWe9GlB34= X-Received: by 10.28.84.19 with SMTP id i19mr3061999wmb.65.1500399920472; Tue, 18 Jul 2017 10:45:20 -0700 (PDT) MIME-Version: 1.0 Received: by 10.28.4.78 with HTTP; Tue, 18 Jul 2017 10:44:59 -0700 (PDT) In-Reply-To: References: <1326408418.5087954.1499988733637.ref@mail.yahoo.com> <1326408418.5087954.1499988733637@mail.yahoo.com> <4F8A36BA-2ABB-48F3-B95E-9B56E94248DA@yahoo.com> <1500223744.32325.1.camel@stephanerb.eu> <782B3359-40EA-4C34-9696-1852F7DC7FC3@yahoo.com> From: Renan DelValle Date: Tue, 18 Jul 2017 10:44:59 -0700 Message-ID: Subject: Re: Aurora reconciliation and Master fail over To: David McLaughlin Cc: dev@aurora.apache.org Content-Type: multipart/alternative; boundary="94eb2c1cd62a4844ee05549b14f5" archived-at: Tue, 18 Jul 2017 17:45:33 -0000 --94eb2c1cd62a4844ee05549b14f5 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Yup, that looks like the way to go. Going to go ahead and file a ticket on JIRA for this so that we don't forget. Thanks for digging into this David. -Renan On Mon, Jul 17, 2017 at 3:00 PM, David McLaughlin wrote: > Based on the thread in the Mesos dev list, it looks like because they > don't persist task information so they don't have the task IDs to send wh= en > they detect the agent is lost during failover. So unless this is changed = on > the Mesos side, we need to act on the slaveLost message and mark all thos= e > tasks as LOST in Aurora. > > Or rely on reconciliation. To reconcile more often, you should keep in > mind: > > 1) Implicit reconciliation sends one message to Mesos and Mesos replies > with N number of status updates immediately, where N =3D number of runnin= g > tasks. This process is usually quick (on the order of seconds) due to bei= ng > mostly NOOP status updates. When you have a large number of running tasks > (say 100k+), you may see some GC pressure due to the flood of status > updates. If this operation overlapped with another particularly expensive > operation (like a snapshot) it can cause a huge stop the world GC. But it > does not otherwise interfere with any operation. > > 2) Explicit reconciliation is done in batches, where Aurora batches up al= l > running tasks and sends one batch at a time, staggered by some delay. The > benefit here is there is less GC pressure, but the drawback is if you hav= e > a lot of running tasks (again, 100k+), it will take over 10 minutes to > complete. So you have to make sure your reconciliation interval is aligne= d > with this (you can always increase the batch size to make this happen > faster). > > Cheers, > David > > On Sun, Jul 16, 2017 at 11:10 AM, Meghdoot bhattacharya < > meghdoot_b@yahoo.com.invalid> wrote: > >> Got it. Thx! >> >> > On Jul 16, 2017, at 9:49 AM, Stephan Erb wrote: >> > >> > Reconciliation in Aurora is not a specific mode. It just runs >> > concurrently to other background work such as snapshots or backups [1]= . >> > >> > >> > Just be aware that we don't have metrics to track the runtime of >> > explicit and implicit reconciliations. If you use settings that are >> > overly aggressive, you might overload Auroras queue of incoming Mesos >> > status updates (for example). >> > >> > [1] https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d= 5 >> > 7069adda434/src/main/java/org/apache/aurora/scheduler/reconciliation/T= a >> > skReconciler.java >> > >> > >> >> On Sat, 2017-07-15 at 22:28 -0700, Meghdoot bhattacharya wrote: >> >> Thx David for the follow up and confirmation. >> >> We have started the thread on the mesos dev DL. >> >> >> >> So to get clarification on the recon, what is in general effect >> >> during the recon. Does scheduling and activities like snapshot is >> >> paused as recon takes place. Trying to see whether to run aggressive >> >> recon in mean time. >> >> >> >> Thx >> >> >> >>> On Jul 15, 2017, at 9:33 AM, David McLaughlin > >>> rg> wrote: >> >>> >> >>> I've left a comment on the initial RB detailing how the change >> >>> broke >> >>> backwards-compatibility. Given that the tasks are marked as lost as >> >>> soon as >> >>> the agent reregisters after slaveLost is sent anyway, there doesn't >> >>> seem to >> >>> be any reason not to send TASK_LOST too. I think this should be an >> >>> easy >> >>> fix. >> >>> >> >>> On Sat, Jul 15, 2017 at 9:21 AM, David McLaughlin > >>> he.org> >> >>> wrote: >> >>> >> >>>> Yes, we've confirmed this internally too (Santhosh did the work >> >>>> here): >> >>>> >> >>>> When an agent becomes unreachable while the master is running, it >> >>>> sends >> >>>>> TASK_LOST events for each task on the agent. >> >>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe >> >>>>> 38e9909f9a2868d32/src/master/master.cpp#L7066-L7107 >> >>>>> Marking agent unreachable after failover does not cause >> >>>>> TASK_LOST events. >> >>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe >> >>>>> 38e9909f9a2868d32/src/master/master.cpp#L2036-L2070 >> >>>>> Once an agent re-registers it sends TASK_LOST events. Agent >> >>>>> sending >> >>>>> TASK_LOST for tasks that it does not know after a Master >> >>>>> failover. >> >>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe >> >>>>> 38e9909f9a2868d32/src/slave/slave.cpp#L1324-L1383 >> >>>> >> >>>> >> >>>> >> >>>> The separate code path for markUnreachableAfterFailover appears >> >>>> to have >> >>>> been added by this commit: >> >>>> https://github.com/apache/mesos/commit/937c85f2f6528d1ac56ea9a7aa >> >>>> 174c >> >>>> a0bd371d0c >> >>>> >> >>>> And I think this totally breaks the promise of introducing the >> >>>> PARTITION_AWARE stuff in a backwards-compatible way. >> >>>> >> >>>> So right now, yes we rely on reconciliation to finally mark the >> >>>> tasks as >> >>>> LOST and reschedule their replacements. >> >>>> >> >>>> I think the only reason we haven't been more impacted by this at >> >>>> Twitter >> >>>> is our Mesos master is remarkably stable (compared to Aurora's >> >>>> daily >> >>>> failovers). >> >>>> >> >>>> We have two paths forward here: push forward and embrace the new >> >>>> partition >> >>>> awareness features in Aurora and/or push back on the above change >> >>>> with the >> >>>> Mesos community and have a better story for non-partition aware >> >>>> APIs in the >> >>>> short term. >> >>>> >> >>>> >> >>>> >> >>>> On Sat, Jul 15, 2017 at 2:01 AM, Meghdoot bhattacharya < >> >>>> meghdoot_b@yahoo.com.invalid> wrote: >> >>>> >> >>>>> We can reproduce it easily as the steps are >> >>>>> 1. Shut down leading mesos master >> >>>>> 2. Shutdown agent at same time >> >>>>> 3. Wait for 10 mins >> >>>>> >> >>>>> What Renan and I saw in the logs were only agent lost and not >> >>>>> task lost >> >>>>> sent. While in regular health check expire scenario both task >> >>>>> lost and >> >>>>> agent lost were sent. >> >>>>> >> >>>>> So yes this is very concerning. >> >>>>> >> >>>>> Thx >> >>>>> >> >>>>>> On Jul 14, 2017, at 10:28 AM, David McLaughlin > >>>>>> pache.org> >> >>>>> >> >>>>> wrote: >> >>>>>> >> >>>>>> It would be interesting to see the logs. I think that will >> >>>>>> tell you if >> >>>>> >> >>>>> the >> >>>>>> Mesos master is: >> >>>>>> >> >>>>>> a) Sending slaveLost >> >>>>>> b) Trying to send TASK_LOST >> >>>>>> >> >>>>>> And then the Scheduler logs (and/or the metrics it exports) >> >>>>>> should tell >> >>>>> >> >>>>> you >> >>>>>> whether those events were received. If this is reproducible, >> >>>>>> I'd >> >>>>> >> >>>>> consider >> >>>>>> it a serious bug. >> >>>>>> >> >>>>>> On Fri, Jul 14, 2017 at 10:04 AM, Meghdoot bhattacharya < >> >>>>>> meghdoot_b@yahoo.com.invalid> wrote: >> >>>>>> >> >>>>>>> So in this situation why is not aurora replacing the tasks >> >>>>>>> and waiting >> >>>>> >> >>>>> for >> >>>>>>> external recon to fix it. >> >>>>>>> >> >>>>>>> This is different when the 75 sec (5*15) health check of >> >>>>>>> slave times >> >>>>> >> >>>>> out >> >>>>>>> (no master failover), aurora replaces it on task lost >> >>>>>>> message. >> >>>>>>> >> >>>>>>> Are you hinting we should ask mesos folks why in master >> >>>>>>> fail over >> >>>>>>> reregistration timeout scenario why task lost not sent >> >>>>>>> though slave >> >>>>> >> >>>>> lost >> >>>>>>> sent and from below docs task lost should have been sent. >> >>>>>>> >> >>>>>>> Because either mesos is not sending the right status or >> >>>>>>> aurora is not >> >>>>>>> handling it. >> >>>>>>> >> >>>>>>> Thx >> >>>>>>> >> >>>>>>>> On Jul 14, 2017, at 8:21 AM, David McLaughlin > >>>>>>>> n@apache.org >> >>>>>>> wrote: >> >>>>>>>> >> >>>>>>>> "1. When mesos sends slave lost after 10 mins in this >> >>>>>>>> situation , why >> >>>>>>> >> >>>>>>> does >> >>>>>>>> aurora not act on it?" >> >>>>>>>> >> >>>>>>>> Because Mesos also sends TASK_LOST for every task running >> >>>>>>>> on the agent >> >>>>>>>> whenever it calls slaveLost: >> >>>>>>>> >> >>>>>>>> When it is time to remove an agent, the master removes >> >>>>>>>> the agent from >> >>>>> >> >>>>> the >> >>>>>>>> list of registered agents in the master=E2=80=99s durable state >> >>>>>>>> > >>>>> >> >>>>> log-internals/> >> >>>>>>> (this >> >>>>>>>> will survive master failover). The master sends a >> >>>>>>>> slaveLost callback >> >>>>> >> >>>>> to >> >>>>>>>> every registered scheduler driver; it also sends >> >>>>>>>> TASK_LOST status >> >>>>> >> >>>>> updates >> >>>>>>>> for every task that was running on the removed agent. >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya < >> >>>>>>>> meghdoot_b@yahoo.com.invalid> wrote: >> >>>>>>>> >> >>>>>>>>> We were investigation slave re registration behavior on >> >>>>>>>>> master fail >> >>>>> >> >>>>> over >> >>>>>>>>> in Aurora 0.17 with mesos 1.1. >> >>>>>>>>> Few important points >> >>>>>>>>> http://mesos.apache.org/documentation/latest/high- >> >>>>>>>>> availability-framework-guide/ (If an agent does not >> >>>>>>>>> reregister with >> >>>>> >> >>>>> the >> >>>>>>>>> new master within a timeout (controlled by the >> >>>>>>> >> >>>>>>> --agent_reregister_timeout >> >>>>>>>>> configuration flag), the master marks the agent as >> >>>>>>>>> failed and follows >> >>>>>>> >> >>>>>>> the >> >>>>>>>>> same steps described above. However, there is one >> >>>>>>>>> difference: by >> >>>>>>> >> >>>>>>> default, >> >>>>>>>>> agents are allowed to reconnect following master >> >>>>>>>>> failover, even after >> >>>>>>> >> >>>>>>> the >> >>>>>>>>> agent_reregister_timeout has fired. This means that >> >>>>>>>>> frameworks might >> >>>>>>> >> >>>>>>> see a >> >>>>>>>>> TASK_LOST update for a task but then later discover >> >>>>>>>>> that the task is >> >>>>>>>>> running (because the agent where it was running was >> >>>>>>>>> allowed to >> >>>>>>> >> >>>>>>> reconnect). >> >>>>>>>>> http://mesos.apache.org/documentation/latest/reconcilia >> >>>>>>>>> tion/ >> >>>>> >> >>>>> (Implicit >> >>>>>>>>> reconciliation (passing an empty list) should also be >> >>>>>>>>> used >> >>>>>>> >> >>>>>>> periodically, as >> >>>>>>>>> a defense against data loss in the framework. Unless a >> >>>>>>>>> strict >> >>>>> >> >>>>> registry >> >>>>>>> is >> >>>>>>>>> in use on the master, its possible for tasks to >> >>>>>>>>> resurrect from a LOST >> >>>>>>> >> >>>>>>> state >> >>>>>>>>> (without a strict registry the master does not enforce >> >>>>>>>>> agent removal >> >>>>>>> >> >>>>>>> across >> >>>>>>>>> failovers). When an unknown task is encountered, the >> >>>>>>>>> scheduler should >> >>>>>>> >> >>>>>>> kill >> >>>>>>>>> or recover the task.) >> >>>>>>>>> https://issues.apache.org/jira/browse/MESOS-5951(Remove >> >>>>>>>>> s strict >> >>>>>>> >> >>>>>>> registry >> >>>>>>>>> mode flag from 1.1 and reverts to the old behavior of >> >>>>>>>>> non strict >> >>>>>>> >> >>>>>>> registry >> >>>>>>>>> mode where tasks and executors were not killed on agent >> >>>>> >> >>>>> reregistration >> >>>>>>>>> timeout on master failover) >> >>>>>>>>> So, what we find, if the slave does not come back after >> >>>>>>>>> 10 mins >> >>>>>>>>> 1. Mesos master sends slave lost but not task lost to >> >>>>>>>>> Aurora.2. >> >>>>> >> >>>>> Aurora >> >>>>>>>>> does not replace the tasks.3. When explicit recon >> >>>>>>>>> starts , then only >> >>>>>>> >> >>>>>>> this >> >>>>>>>>> gets corrected with aurora spawning replacement tasks. >> >>>>>>>>> If slave restarts after 10 mins >> >>>>>>>>> 1. When implicit recon starts, this situation gets >> >>>>>>>>> fixed because in >> >>>>>>> >> >>>>>>> aurora >> >>>>>>>>> it is marked as lost and mesos sends running and those >> >>>>>>>>> get killed and >> >>>>>>>>> replaced. >> >>>>>>>>> So, questions >> >>>>>>>>> 1. When mesos sends slave lost after 10 mins in this >> >>>>>>>>> situation , why >> >>>>>>> >> >>>>>>> does >> >>>>>>>>> aurora not act on it?2. As per recon docs best >> >>>>>>>>> practices, explicit >> >>>>> >> >>>>> recon >> >>>>>>>>> should start followed by implicit recon on master >> >>>>>>>>> failover. Looks >> >>>>> >> >>>>> like >> >>>>>>>>> aurora is not doing that and the regular hourly recons >> >>>>>>>>> are running >> >>>>> >> >>>>> with >> >>>>>>> 30 >> >>>>>>>>> min spread between explicit and implicit. Should aurora >> >>>>>>>>> do recon on >> >>>>>>> >> >>>>>>> master >> >>>>>>>>> fail over? >> >>>>>>>>> >> >>>>>>>>> General questions1. What is the effect on aurora if we >> >>>>>>>>> make explicit >> >>>>>>> >> >>>>>>> recon >> >>>>>>>>> every 15 mins instead of default 1 hr? Does it slow >> >>>>>>>>> down scheduling, >> >>>>>>> >> >>>>>>> does >> >>>>>>>>> snapshot creation gets delayed etc? >> >>>>>>>>> 2. Any issue if spread between explicit recon and >> >>>>>>>>> implicit recon >> >>>>> >> >>>>> brought >> >>>>>>>>> down to 2 mins from 30 mins? probably depend on 1. >> >>>>>>>>> Thx >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>> >> >>>>> >> >> >> >> >> >> > --94eb2c1cd62a4844ee05549b14f5--