Mailing-List: contact dev-help@airflow.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@airflow.incubator.apache.org
MIME-Version: 1.0
Message-ID: <pony-5310d811b53cf22da80d5c164095e5e1dfd1aa67-e610b51fe1a779d89b3ad01cb946e55d98627851@dev.airflow.apache.org>
Subject: Re: Automatic DAGs deployment
References: <CAOcXccFvVWOQmHCo93GdFbn1NP3mN_2s+r4Ut4sJ0X2-aYDinA@mail.gmail.com> <pony-5310d811b53cf22da80d5c164095e5e1dfd1aa67-aef8bc0bd2b3c51b942189056f3ea4c5ba968282@dev.airflow.apache.org>
 <CANyD7miV-h_mp9DQZxQdK-T3cHoKYXs-oOsic3VmWfB_9Bp63A@mail.gmail.com>
From: "Gaetan Semet"<gaetan@xeberon.net>
In-Reply-To: <CAOcXccFvVWOQmHCo93GdFbn1NP3mN_2s+r4Ut4sJ0X2-aYDinA@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
Date: Tue, 07 Nov 2017 21:04:38 -0000
To: <dev@airflow.apache.org>
archived-at: Wed, 08 Nov 2017 01:19:21 -0000

Hi

I agree Airflow at least needs some kind of "best practice" or an official handbook to handle deployment of the DAGs.
I used to work a lot on buildbot, these issues kind of remind me a little the good old time :)

Why not keeping a copy of the DAG in memory/db once the scheduler has started the execution of it ? 
So that, no matter what happen on disk, the current execution will always use the same version. That would mean every executor and scheduler keeps a small cache with at least all currently being executed DAG. If the code is tracked in git, that is even more easy to do since we have this db directly.
So when a DAG start:
- each potential executors and the scheduler receive an order "keep this reference in cache". if DAG are in git, just keep the sha1 of the file
- each time an step is performed, read the file on disk most of the time and if it has been updated (if git => sha1 are different, if not git => hashes are different), retrieve the content from the DB (this is the small tricky part of this proposal)

what do you think about this?

Gaetan 

On 2017-11-07 15:30, Grant Nicholas <grantnicholas2015@u.northwestern.edu> wrote: 
> +1
> 
> Long term would be awesome if airflow supported upgrades of in flight dags
> with a hashing/versioning setup.
> 
> But as a first step, would be good to document how we want people to
> upgrade dags. (Or at least a disclaimer talking about the pitfalls).
> 
> 
> On Nov 6, 2017 3:08 PM, "Daniel Imberman" <daniel.imberman@gmail.com> wrote:
> 
> > +1 for this conversation.
> >
> > I know that most of the production airflow instances basically just have a
> > policy of "don't update the DAG files while a job is running."
> >
> > One thing that is difficult with this, however, is that for CeleryExecutor
> > and KubernetesExecutor we don't really have any power over the DAG
> > refreshes. If you're storing your DAGs in s3 or NFS, we can't stop or
> > trigger a refresh of the DAGs. I'd be interested to see what others have
> > done for this and if there's anything we can do to standardize this.
> >
> > On Mon, Nov 6, 2017 at 12:34 PM Gaetan Semet <gaetan@xeberon.net> wrote:
> >
> > > Hello
> > >
> > > I am working with Airflow to see how we can use it in my company, and I
> > > volunteer to help you if you need help on some parts. I used to work a
> > lot
> > > with Python and Twisted, but real, distributed scheduling is kind of a
> > new
> > > sport for me.
> > >
> > > I see that deploying DAGs regularly is not as easy as we can imagine. I
> > > started playing with git-sync and apparently it is not recommended in
> > prod
> > > since it can lead to an incoherent state if the scheduler is refreshed in
> > > the middle of the execution. But DAGs lives and they can be updated by
> > > users and I think Airflow needs a way to allow automatic refresh of the
> > > DAGs without having to stop the scheduler.
> > >
> > > Does anyone already works on it, or do you have a set of JIRA ticket
> > > covering this issue so I can start working on it ?
> > >
> > > Best Regards,
> > > Gaetan Semet
> > >
> >
>