Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 896B8200D41 for ; Wed, 8 Nov 2017 02:19:21 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 87F99160C00; Wed, 8 Nov 2017 01:19:21 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id CC61C160BED for ; Wed, 8 Nov 2017 02:19:20 +0100 (CET) Received: (qmail 70711 invoked by uid 500); 8 Nov 2017 01:19:19 -0000 Mailing-List: contact dev-help@airflow.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.incubator.apache.org Delivered-To: mailing list dev@airflow.incubator.apache.org Delivered-To: moderator for dev@airflow.incubator.apache.org Received: (qmail 81439 invoked by uid 99); 7 Nov 2017 22:26:14 -0000 X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.001 X-Spam-Level: * X-Spam-Status: No, score=1.001 tagged_above=-999 required=6.31 tests=[FROM_MISSPACED=0.001, KAM_LAZY_DOMAIN_SECURITY=1] autolearn=disabled MIME-Version: 1.0 Message-ID: Subject: Re: Automatic DAGs deployment References: From: "Gaetan Semet" In-Reply-To: Content-Type: text/plain; charset="iso-8859-1" x-ponymail-sender: 5310d811b53cf22da80d5c164095e5e1dfd1aa67 Date: Tue, 07 Nov 2017 21:04:38 -0000 x-ponymail-agent: PonyMail Composer/0.2 To: X-Mailer: LuaSocket 3.0-rc1 archived-at: Wed, 08 Nov 2017 01:19:21 -0000 Hi I agree Airflow at least needs some kind of "best practice" or an official handbook to handle deployment of the DAGs. I used to work a lot on buildbot, these issues kind of remind me a little the good old time :) Why not keeping a copy of the DAG in memory/db once the scheduler has started the execution of it ? So that, no matter what happen on disk, the current execution will always use the same version. That would mean every executor and scheduler keeps a small cache with at least all currently being executed DAG. If the code is tracked in git, that is even more easy to do since we have this db directly. So when a DAG start: - each potential executors and the scheduler receive an order "keep this reference in cache". if DAG are in git, just keep the sha1 of the file - each time an step is performed, read the file on disk most of the time and if it has been updated (if git => sha1 are different, if not git => hashes are different), retrieve the content from the DB (this is the small tricky part of this proposal) what do you think about this? Gaetan On 2017-11-07 15:30, Grant Nicholas wrote: > +1 > > Long term would be awesome if airflow supported upgrades of in flight dags > with a hashing/versioning setup. > > But as a first step, would be good to document how we want people to > upgrade dags. (Or at least a disclaimer talking about the pitfalls). > > > On Nov 6, 2017 3:08 PM, "Daniel Imberman" wrote: > > > +1 for this conversation. > > > > I know that most of the production airflow instances basically just have a > > policy of "don't update the DAG files while a job is running." > > > > One thing that is difficult with this, however, is that for CeleryExecutor > > and KubernetesExecutor we don't really have any power over the DAG > > refreshes. If you're storing your DAGs in s3 or NFS, we can't stop or > > trigger a refresh of the DAGs. I'd be interested to see what others have > > done for this and if there's anything we can do to standardize this. > > > > On Mon, Nov 6, 2017 at 12:34 PM Gaetan Semet wrote: > > > > > Hello > > > > > > I am working with Airflow to see how we can use it in my company, and I > > > volunteer to help you if you need help on some parts. I used to work a > > lot > > > with Python and Twisted, but real, distributed scheduling is kind of a > > new > > > sport for me. > > > > > > I see that deploying DAGs regularly is not as easy as we can imagine. I > > > started playing with git-sync and apparently it is not recommended in > > prod > > > since it can lead to an incoherent state if the scheduler is refreshed in > > > the middle of the execution. But DAGs lives and they can be updated by > > > users and I think Airflow needs a way to allow automatic refresh of the > > > DAGs without having to stop the scheduler. > > > > > > Does anyone already works on it, or do you have a set of JIRA ticket > > > covering this issue so I can start working on it ? > > > > > > Best Regards, > > > Gaetan Semet > > > > > >