airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Davydov <ddavy...@twitter.com.INVALID>
Subject Re: AIP-12 Persist DAG into DB
Date Fri, 01 Feb 2019 18:05:42 GMT
@Max
What I've been thinking about recently is creating an abstraction for the
serialization process. I think in general it makes sense to have for e.g.
dynamic DAGs to have a service that periodically serializes DAGs and
uploads them to e.g. a database via some new Airflow DAG Uploader Service.
There should be proper support for authentication models for this
DB/wrapping service. You would also potentially get the ability for users
to submit ad-hoc DAGs to the production server this way too (instead of
needing a custom devel instance).

On Fri, Feb 1, 2019 at 12:43 PM Ben Tallman <btallman@gmail.com> wrote:

> In my experience, there are two major wins to chase here. Neither are
> simple, nor is this the first discussion around them. In the past there was
> an attempt to use Pickling to handle these challenges.
>
> The first is that with dynamic dags (they are evaluated as python code
> after all), it is possible that each DagRun of a Dag is different, either
> slightly or completely. This is a very powerful concept, but currently
> basically breaks, as the Dag itself is re-evaluated every time it is used,
> and therefore needs to be quite stable during a DagRun. I believe it would
> be a huge win if the DagRun itself were stable from the time it starts
> until it's completion, across the whole cluster, and then even into the
> history of runs in the webserver.
>
> The second win to chase is a bit different, and deals with the history of
> DagRuns. Specifically, what happens to history (logs, results, etc) when a
> Dag is re-run, either because of an error that has been corrected, or
> because the user has changed the Dag and decides to backfill. In that case,
> I believe that being able to see the history of a Dag's run in a particular
> schedule is hugely valuable, both for retaining history (chain of
> custody/audit like reasons), as well as seeing change over time and
> tracking statistics.
>
> Just my few cents.
>
> Thanks,
> Ben
>
> --
> Ben Tallman - 503.680.5709
>
>
> On Thu, Jan 31, 2019 at 10:12 PM Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > Right, it's been discussed extensively in the past and the main thing
> > needed to get to a "stateless web server" (or at least a DagBag-free web
> > server) is to drop the template rendering in the UI. Also we might need
> > little workarounds (we'd have to dig in to check) around deleting task
> > instances or force-running tasks, nothing major I think.
> >
> > Also the scheduler (think of it as a "supervisor", as this specific
> > workload has nothing to do with scheduling), would need to serialize the
> > DAGs periodically, likely to the database, so that the web server can get
> > freshly serialized metadata from the database during the scope of web
> > requests.
> >
> > Max
> >
> > On Thu, Jan 31, 2019 at 9:28 AM Dan Davydov <ddavydov@twitter.com.invalid
> >
> > wrote:
> >
> > > Agreed on complexities (I think deprecating Jinja templates for
> webserver
> > > rendering is one thing), but I'm not sure I understand on the falling
> > down
> > > on code changes part, mind providing an example?
> > >
> > > On Thu, Jan 31, 2019 at 12:22 PM Ash Berlin-Taylor <ash@apache.org>
> > wrote:
> > >
> > > > That sounds like a good idea at first, but falls down with possible
> > code
> > > > changes in operators between one task and the next.
> > > >
> > > > (I would like this, but there are definite complexities)
> > > >
> > > > -ash
> > > >
> > > >
> > > > On 31 January 2019 16:56:54 GMT, Dan Davydov
> > > <ddavydov@twitter.com.INVALID>
> > > > wrote:
> > > > >I feel the right higher-level solution to this problem (which is
> > > > >"Adding
> > > > >Consistency to Airflow") is DAG serialization, that is all DAGs
> should
> > > > >be
> > > > >represented as e.g. JSON (similar to the current SimpleDAGBag object
> > > > >used
> > > > >by the Scheduler). This solves the webserver issue, and also adds
> > > > >consistency between Scheduler/Workers (all DAGruns can be ensured
to
> > > > >run at
> > > > >the same version of a DAG instead of whatever happens to live on the
> > > > >worker
> > > > >at the time).
> > > > >
> > > > >On Thu, Jan 31, 2019 at 9:44 AM Peter van ‘t Hof <
> > > > >petervanthof@godatadriven.com> wrote:
> > > > >
> > > > >> Hi All,
> > > > >>
> > > > >> As most of you guys know, airflow got an issue when loading new
> dags
> > > > >where
> > > > >> the webserver sometimes sees it and sometimes not.
> > > > >> Because of this we did wrote this AIP to solve this issue:
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-12+Persist+DAG+into+DB
> > > > >>
> > > > >> Any feedback is welcome.
> > > > >>
> > > > >> Gr,
> > > > >> Peter van 't Hof
> > > > >> Big Data Engineer
> > > > >>
> > > > >> GoDataDriven
> > > > >> Wibautstraat 202
> > > > >> 1091 GS Amsterdam
> > > > >> https://godatadriven.com
> > > > >>
> > > > >>
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message