airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Tallman <btall...@gmail.com>
Subject Re: AIP-12 Persist DAG into DB
Date Fri, 01 Feb 2019 17:42:42 GMT
In my experience, there are two major wins to chase here. Neither are
simple, nor is this the first discussion around them. In the past there was
an attempt to use Pickling to handle these challenges.

The first is that with dynamic dags (they are evaluated as python code
after all), it is possible that each DagRun of a Dag is different, either
slightly or completely. This is a very powerful concept, but currently
basically breaks, as the Dag itself is re-evaluated every time it is used,
and therefore needs to be quite stable during a DagRun. I believe it would
be a huge win if the DagRun itself were stable from the time it starts
until it's completion, across the whole cluster, and then even into the
history of runs in the webserver.

The second win to chase is a bit different, and deals with the history of
DagRuns. Specifically, what happens to history (logs, results, etc) when a
Dag is re-run, either because of an error that has been corrected, or
because the user has changed the Dag and decides to backfill. In that case,
I believe that being able to see the history of a Dag's run in a particular
schedule is hugely valuable, both for retaining history (chain of
custody/audit like reasons), as well as seeing change over time and
tracking statistics.

Just my few cents.

Thanks,
Ben

--
Ben Tallman - 503.680.5709


On Thu, Jan 31, 2019 at 10:12 PM Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> Right, it's been discussed extensively in the past and the main thing
> needed to get to a "stateless web server" (or at least a DagBag-free web
> server) is to drop the template rendering in the UI. Also we might need
> little workarounds (we'd have to dig in to check) around deleting task
> instances or force-running tasks, nothing major I think.
>
> Also the scheduler (think of it as a "supervisor", as this specific
> workload has nothing to do with scheduling), would need to serialize the
> DAGs periodically, likely to the database, so that the web server can get
> freshly serialized metadata from the database during the scope of web
> requests.
>
> Max
>
> On Thu, Jan 31, 2019 at 9:28 AM Dan Davydov <ddavydov@twitter.com.invalid>
> wrote:
>
> > Agreed on complexities (I think deprecating Jinja templates for webserver
> > rendering is one thing), but I'm not sure I understand on the falling
> down
> > on code changes part, mind providing an example?
> >
> > On Thu, Jan 31, 2019 at 12:22 PM Ash Berlin-Taylor <ash@apache.org>
> wrote:
> >
> > > That sounds like a good idea at first, but falls down with possible
> code
> > > changes in operators between one task and the next.
> > >
> > > (I would like this, but there are definite complexities)
> > >
> > > -ash
> > >
> > >
> > > On 31 January 2019 16:56:54 GMT, Dan Davydov
> > <ddavydov@twitter.com.INVALID>
> > > wrote:
> > > >I feel the right higher-level solution to this problem (which is
> > > >"Adding
> > > >Consistency to Airflow") is DAG serialization, that is all DAGs should
> > > >be
> > > >represented as e.g. JSON (similar to the current SimpleDAGBag object
> > > >used
> > > >by the Scheduler). This solves the webserver issue, and also adds
> > > >consistency between Scheduler/Workers (all DAGruns can be ensured to
> > > >run at
> > > >the same version of a DAG instead of whatever happens to live on the
> > > >worker
> > > >at the time).
> > > >
> > > >On Thu, Jan 31, 2019 at 9:44 AM Peter van ‘t Hof <
> > > >petervanthof@godatadriven.com> wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> As most of you guys know, airflow got an issue when loading new dags
> > > >where
> > > >> the webserver sometimes sees it and sometimes not.
> > > >> Because of this we did wrote this AIP to solve this issue:
> > > >>
> > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-12+Persist+DAG+into+DB
> > > >>
> > > >> Any feedback is welcome.
> > > >>
> > > >> Gr,
> > > >> Peter van 't Hof
> > > >> Big Data Engineer
> > > >>
> > > >> GoDataDriven
> > > >> Wibautstraat 202
> > > >> 1091 GS Amsterdam
> > > >> https://godatadriven.com
> > > >>
> > > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message