From dev-return-6514-archive-asf-public=cust-asf.ponee.io@airflow.incubator.apache.org Wed Sep 12 23:35:50 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 7EBDD180630 for ; Wed, 12 Sep 2018 23:35:49 +0200 (CEST) Received: (qmail 90520 invoked by uid 500); 12 Sep 2018 21:35:48 -0000 Mailing-List: contact dev-help@airflow.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.incubator.apache.org Delivered-To: mailing list dev@airflow.incubator.apache.org Received: (qmail 90508 invoked by uid 99); 12 Sep 2018 21:35:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Sep 2018 21:35:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 417E51A0952 for ; Wed, 12 Sep 2018 21:35:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.99 X-Spam-Level: * X-Spam-Status: No, score=1.99 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=fathomhealth-co.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id rqGEmywdCZYV for ; Wed, 12 Sep 2018 21:35:46 +0000 (UTC) Received: from mail-io1-f53.google.com (mail-io1-f53.google.com [209.85.166.53]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 42BC65F485 for ; Wed, 12 Sep 2018 21:35:45 +0000 (UTC) Received: by mail-io1-f53.google.com with SMTP id y10-v6so1457996ioa.10 for ; Wed, 12 Sep 2018 14:35:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fathomhealth-co.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=pe6YPltANWqUwiKtAMAjyoi3oMho3QuifAOCFy8Mo8Y=; b=G4leKY0ZLFigkDZgz1ZXThQkPY+1ifoEesXTNbaCIKBKtB50pOzIGze+otuQ0q8gjm AbH8OEvu4ruPZiUrrJsbGWUBuaZS/+SlZkyj0TTtkboeXAyDwDnmWuURdxuvaJlSyt6f WZj2jc0tTHrXbY5nKGOq4f9JZBiVtTq0fj0iw4Y2Ov0xPWkO83vOE2HE6SFzEREvdOQm eKywjTZmapGv7w7IJpQ40/e0G3poeUJMPQ/gKWx3UxeIPBPy+0vvIMQmTf7BWCQJvdt1 coUEuyAMjdW1XBcvXhjm30UH5OsE6o08yFLoTnyHLQEvhnvWj50SYvRKjaIHIEz/eI1I Fq8Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=pe6YPltANWqUwiKtAMAjyoi3oMho3QuifAOCFy8Mo8Y=; b=IjQ9aD4/ruNCfIIYmxqo6bKZKxHz1BQd9sm6YlCB/gTedRzW2MNwe2nMYMVfTcbuCh d/L13p5sNPM4/oeEzx/DTDE+/I6mswhHywjUSK4fSYWmAtFy/MN5UXMuuLjjobZZpGh9 DnfmyIA/zvTSYPBNen1Z/dgex9NW/FMSmLEvGl0c/+Ec+/KAW52Y8Xx5wi96PPu7L78e S0bKWQObzm7LzQO7pw5EEuKnD3eANAXPCUNoXNMaKIMCIHJ0d1/XniKmYW2ELtFeEFQ9 aR2QJHyRlLv8RbalN0/vFxeyP8ZizzY4ygD4LF4GuTSd44gLNQd90EFJt7oX5Pl1xsK3 d97w== X-Gm-Message-State: APzg51DEchDRUJT/nUKdOF7DbX/1saPILzqnDZyZDP2BUDHb3MKxNdxW roWVvocAXuBMu6Fp0fj55sFsi+/wCXpGJULvvQsF67BDeEI= X-Google-Smtp-Source: ANB0VdYivCWI82T/IxkipuRtqvK8My2PZuk2tmwVDFgEavsUw2VeaEG+Ti0axKhrjG/69drvecHWpFcfQ6yOh7EbwfE= X-Received: by 2002:a6b:cd8f:: with SMTP id d137-v6mr3651942iog.154.1536788143627; Wed, 12 Sep 2018 14:35:43 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Kevin Lam Date: Wed, 12 Sep 2018 17:35:32 -0400 Message-ID: Subject: Re: Making Airflow Fault-Tolerant when running Airflow on Kubernetes To: dev@airflow.incubator.apache.org Content-Type: multipart/alternative; boundary="00000000000065838f0575b35f84" --00000000000065838f0575b35f84 Content-Type: text/plain; charset="UTF-8" Hi Daniel, Thanks for the reply! No we haven't looked too deeply into it. Can you elaborate a bit on how that works? With the KubernetesExecutor, if a DAG is in flight and part of airflow go down, it will be able to recover? How do airflow workers reconnect to Pods that were in flight? On Wed, Sep 12, 2018 at 4:59 PM Daniel Imberman wrote: > Hi Kevin, > > Have you looked into the KubernetesExecutor? We achieve fault tolerance > using the kubernetes resourceVersion to ensure that all state is > reproducible. > > On Wed, Sep 12, 2018 at 1:08 PM Kevin Lam wrote: > > > Hi all, > > > > We currently run Airflow as a Deployment in a kubernetes cluster. We also > > use a variant of KubernetesOperator to run our DAGs. > > > > We are investigating how to best make Airflow fault-tolerant, in part, > due > > to investigating the use of preemptible vms [1]. *Has there been much > > discussion about about how to deploy Airflow in a fault-tolerant way? Are > > there any best practices? Ideally we'd like our kubernetes-hosted Airflow > > to support rolling updates for Docker image updates and also recover from > > components (worker, scheduler, web) going down temporarily, including > when > > DAGs are in flight. * > > > > Any advice, ideas and/or feedback appreciated! > > > > [1] > https://cloud.google.com/kubernetes-engine/docs/how-to/preemptible-vms > > > --00000000000065838f0575b35f84--