From dev-return-6512-archive-asf-public=cust-asf.ponee.io@airflow.incubator.apache.org Wed Sep 12 22:08:49 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 34454180630 for ; Wed, 12 Sep 2018 22:08:49 +0200 (CEST) Received: (qmail 43830 invoked by uid 500); 12 Sep 2018 20:08:48 -0000 Mailing-List: contact dev-help@airflow.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.incubator.apache.org Delivered-To: mailing list dev@airflow.incubator.apache.org Received: (qmail 43818 invoked by uid 99); 12 Sep 2018 20:08:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Sep 2018 20:08:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id DAA3AC05F4 for ; Wed, 12 Sep 2018 20:08:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.99 X-Spam-Level: * X-Spam-Status: No, score=1.99 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=fathomhealth-co.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id vn7xH8jFpXqU for ; Wed, 12 Sep 2018 20:08:46 +0000 (UTC) Received: from mail-io1-f51.google.com (mail-io1-f51.google.com [209.85.166.51]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id C8E715F1B9 for ; Wed, 12 Sep 2018 20:08:45 +0000 (UTC) Received: by mail-io1-f51.google.com with SMTP id r196-v6so1308011iod.0 for ; Wed, 12 Sep 2018 13:08:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fathomhealth-co.20150623.gappssmtp.com; s=20150623; h=mime-version:from:date:message-id:subject:to; bh=vbRsfhi3NfgB3TSG5fiP7CQMaOqPcGI5Q5Dumk+4wWg=; b=AGt19sIA2Cc+FNOYWkdOkUzDUZpAyviDL0Ecpnr0hWenteUYat54CQ9sD3eLViC9DO WUanpkuWnj6tdfmmxgIus1m67bWLLplw4oiSSZnQFJTxsox/ibfEN+/yDciyJeImV1rw qW7zOydJ1Ahtu119xKfbWV2ovjzhgWFP1ALhXUigKzp0TS+vQEePDt/RlMvOCxyh9gBF dpAxXYdPJZReqQy0ggZ7+TG7fNHE9S++MVf8NkBcpF+MsZcYAwpwfvmrH02kDqhJ1byX q9YrE05kMXZv8HLQsPK/srWt1+bK5efbW15N1PQzXH/B8Rj112G4JNCB82x6tyy04/BE pPbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=vbRsfhi3NfgB3TSG5fiP7CQMaOqPcGI5Q5Dumk+4wWg=; b=KwGRYgYyMirx8LFc9v40TeIa79m2t+RsHusJcDJ+BW2meZi/WW+ljAD6pUHQUveSJr mmXWBcPuVm05ipBlzWlzf2083WKUVQmySUCdQTyWTPOA0RB7yhn0o0nuR9ojQ0KpVFp4 YgZ962J0iWHxe6sWfQWakiN5qckqbiS3sCGdUFI/4c385oCz3BpNAIyQut0pKdnMSV3+ 2dzUI/VUZFDmPePb+uxSVBa7VaL4kC53eIEqHhKzlxpwyP0erMhiCUNMCAFPCv36q6mZ ebdBfx7W2d1BMLi9YQJ8akT3SmWKV3U0BG9YKebWuJyjD0yBp8tWSnjFezLu7bsHMFKy vXAw== X-Gm-Message-State: APzg51CNVAxIpGZnKHUzg/nwQXyzI3WVDb0yh3BUNC0dC+jRezr2LTR3 njNGUpDs2KQGzyq1WNsoqkOKpYxuymK3LMJzJvUc6JVlm1Yg7g== X-Google-Smtp-Source: ANB0VdZ7bftWLSwJZTsR192KXP21mjcb3B4EQqYuTBKxGnr9e9DUR4KBAVlzGALNdEGvnsrx59X1DHPp+xVzcMXgNBM= X-Received: by 2002:a6b:cd8f:: with SMTP id d137-v6mr3343861iog.154.1536782924451; Wed, 12 Sep 2018 13:08:44 -0700 (PDT) MIME-Version: 1.0 From: Kevin Lam Date: Wed, 12 Sep 2018 16:08:33 -0400 Message-ID: Subject: Making Airflow Fault-Tolerant when running Airflow on Kubernetes To: dev@airflow.incubator.apache.org Content-Type: multipart/alternative; boundary="0000000000004f2b2a0575b228f3" --0000000000004f2b2a0575b228f3 Content-Type: text/plain; charset="UTF-8" Hi all, We currently run Airflow as a Deployment in a kubernetes cluster. We also use a variant of KubernetesOperator to run our DAGs. We are investigating how to best make Airflow fault-tolerant, in part, due to investigating the use of preemptible vms [1]. *Has there been much discussion about about how to deploy Airflow in a fault-tolerant way? Are there any best practices? Ideally we'd like our kubernetes-hosted Airflow to support rolling updates for Docker image updates and also recover from components (worker, scheduler, web) going down temporarily, including when DAGs are in flight. * Any advice, ideas and/or feedback appreciated! [1] https://cloud.google.com/kubernetes-engine/docs/how-to/preemptible-vms --0000000000004f2b2a0575b228f3--