airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From siddharth anand <san...@apache.org>
Subject Re: Airflow 2.0
Date Mon, 21 Nov 2016 22:25:57 GMT
Sergei,
These are some great ideas -- I would classify at least half of them as
pain points.

Folks!
I suggest people (on the dev list) keep feeding this thread at least for
the next 2 days. I can then float a survey based on these ideas and give
the community a chance to vote so we can prioritize the wish list.

-s

On Mon, Nov 21, 2016 at 5:22 AM, Sergei Iakhnin <llevar@gmail.com> wrote:

> I've been running Airflow on 1500 cores in the context of scientific
> workflows for the past year and a half. Features that would be important to
> me for 2.0:
>
> - Add FK to dag_run to the task_instance table on Postgres so that
> task_instances can be uniquely attributed to dag runs.
> - Ensure scheduler can be run continuously without needing restarts. Right
> now it gets into some ill-determined bad state forcing me to restart it
> every 20 minutes.
> - Ensure scheduler can handle tens of thousands of active workflows. Right
> now this results in extremely long scheduling times and inconsistent
> scheduling even at 2 thousand active workflows.
> - Add more flexible task scheduling prioritization. The default
> prioritization is the opposite of the behaviour I want. I would prefer that
> downstream tasks always have higher priority than upstream tasks to cause
> entire workflows to tend to complete sooner, rather than scheduling tasks
> from other workflows. Having a few scheduling prioritization strategies
> would be beneficial here.
> - Provide better support for manually-triggered DAGs on the UI i.e. by
> showing them as queued.
> - Provide some resource management capabilities via something like slots
> that can be defined on workers and occupied by tasks. Using celery's
> concurrency parameter at the airflow server level is too coarse-grained as
> it forces all workers to be the same, and does not allow proper resource
> management when different workflow tasks have different resource
> requirements thus hurting utilization (a worker could run 8 parallel tasks
> with small memory footprint, but only 1 task with large memory footprint
> for instance).
>
> With best regards,
>
> Sergei.
>
>
> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
> ext-pavlo.ryabchuk@here.com>
> wrote:
>
> > -1. We extremely rely on data profiling, as a pipeline health monitoring
> > tool
> >
> > -----Original Message-----
> > From: Chris Riccomini [mailto:criccomini@apache.org]
> > Sent: Saturday, November 19, 2016 1:57 AM
> > To: dev@airflow.incubator.apache.org
> > Subject: Re: Airflow 2.0
> >
> > > RIP out the charting application and the data profiler
> >
> > Yes please! +1
> >
> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
> > maximebeauchemin@gmail.com> wrote:
> > > Another point that may be controversial for Airflow 2.0: RIP out the
> > > charting application and the data profiler. Even though it's nice to
> > > have it there, it's just out of scope and has major security
> > issues/implications.
> > >
> > > I'm not sure how popular it actually is. We may need to run a survey
> > > at some point around this kind of questions.
> > >
> > > Max
> > >
> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
> > > maximebeauchemin@gmail.com> wrote:
> > >
> > >> Using FAB's Model, we get pretty much all of that (REST API,
> > >> auth/perms,
> > >> CRUD) for free:
> > >> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ffla
> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7C%7C0064f
> > >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea64919%7C1&sd
> > >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
> > >> quickhowto.html?highlight=rest#exposed-methods
> > >>
> > >> I'm pretty intimate with FAB since I use it (and contributed to it)
> > >> for Superset/Caravel.
> > >>
> > >> All that's needed is to derive FAB's model class instead of
> > >> SqlAlchemy's model class (which FAB's model wraps and adds
> > >> functionality to and is 100% compatible AFAICT).
> > >>
> > >> Max
> > >>
> > >> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini
> > >> <criccomini@apache.org>
> > >> wrote:
> > >>
> > >>> > It may be doable to run this as a different package
> > >>> `airflow-webserver`, an
> > >>> > alternate UI at first, and to eventually rip out the old UI off
of
> > >>> > the
> > >>> main
> > >>> > package.
> > >>>
> > >>> This is the same strategy that I was thinking of for AIRFLOW-85. You
> > >>> can build the new UI in parallel, and then delete the old one later.
> > >>> I really think that a REST interface should be a pre-req to any
> > >>> large/new UI changes, though. Getting unified so that everything is
> > >>> driven through REST will be a big win.
> > >>>
> > >>> On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin
> > >>> <maximebeauchemin@gmail.com> wrote:
> > >>> > A multi-tenant UI with composable roles on top of granular
> > permissions.
> > >>> >
> > >>> > Migrating from Flask-Admin to Flask App Builder would be an
> > >>> > easy-ish win (since they're both Flask). FAB Provides a good
> > >>> > authentication and permission model that ships out-of-the-box
with
> > >>> > a REST api. Suffice to define FAB models (derivative of
> > >>> > SQLAlchemy's model) and you get a set
> > >>> of
> > >>> > perms for the model (can_show, can_list, can_add, can_change,
> > >>> can_delete,
> > >>> > ...) and a set of CRUD REST endpoints. It would also allow us
to
> > >>> > rip out the authentication backend code out of Airflow and rely
on
> > FAB for that.
> > >>> > Also every single view gets permissions auto-created for it, and
> > >>> > there
> > >>> are
> > >>> > easy way to define row-level type filters based on user
> permissions.
> > >>> >
> > >>> > It may be doable to run this as a different package
> > >>> `airflow-webserver`, an
> > >>> > alternate UI at first, and to eventually rip out the old UI off
of
> > >>> > the
> > >>> main
> > >>> > package.
> > >>> >
> > >>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
> > >>> > Fflask-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7C%
> > >>> > 7C0064f74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea64
> > >>> > 919%7C1&sdata=8mUPRcf4%2FQUDSbju%2BjLLImalhZeU7tOA%2BFpeO%2BjcEs8%
> > >>> > 3D&reserved=0
> > >>> >
> > >>> > I'd love to carve some time and lead this.
> > >>> >
> > >>> > On Fri, Nov 18, 2016 at 1:32 PM, Chris Riccomini
> > >>> > <criccomini@apache.org
> > >>> >
> > >>> > wrote:
> > >>> >
> > >>> >> Full-fledged REST API (that the UI also uses) would be great
in
> 2.0.
> > >>> >>
> > >>> >> On Fri, Nov 18, 2016 at 6:26 AM, David Kegley <kegs@b23.io>
> wrote:
> > >>> >> > Hi All,
> > >>> >> >
> > >>> >> > We have been using Airflow heavily for the last couple
months
> > >>> >> > and
> > >>> it’s
> > >>> >> been great so far. Here are a few things we’d like to see
> > >>> >> prioritized
> > >>> in
> > >>> >> 2.0.
> > >>> >> >
> > >>> >> > 1) Role based access to DAGs:
> > >>> >> > We would like to see better role based access through
the UI.
> > >>> There’s a
> > >>> >> related ticket out there but it hasn’t seen any action in
a few
> > >>> >> months
> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2
> > >>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-85&data=01%7C01
> > >>> >> > %7C%7C0064f74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391
> > >>> >> > feaea64919%7C1&sdata=VsgwHZxr0%2FDQN1jeBTJsfyIGu%2FZkkWhzAvxNvB
> > >>> >> > N531k%3D&reserved=0
> > >>> >> >
> > >>> >> > We use a templating system to create/deploy DAGs dynamically
> > >>> >> > based on
> > >>> >> some directory/file structure. This allows analysts to quickly
> > >>> >> deploy
> > >>> and
> > >>> >> schedule their ETL code without having to interact with the
> > >>> >> Airflow installation directly. It would be great if those
same
> > >>> >> analysts could access to their own DAGs in the UI so that
they
> > >>> >> can clear DAG runs,
> > >>> mark
> > >>> >> success, etc. while keeping them away from our core ETL and
other
> > >>> >> people's/organization's DAGs. Some of this can be accomplished
> > >>> >> with
> > >>> ‘filter
> > >>> >> by owner’ but it doesn’t address the use case where a
DAG can be
> > >>> maintained
> > >>> >> by multiple users in the same organization when they have
> > >>> >> separate
> > >>> Airflow
> > >>> >> user accounts.
> > >>> >> >
> > >>> >> > 2) An option to turn off backfill:
> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2
> > >>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-558&data=01%7C0
> > >>> >> > 1%7C%7C0064f74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b8539
> > >>> >> > 1feaea64919%7C1&sdata=Xkz7dTkFMEa4np19m4ML1VajVqVPNy%2BVSS5Y%2B
> > >>> >> > Sm8Odk%3D&reserved=0 For cases where a DAG does an
insert
> > >>> >> > overwrite on a table every day.
> > >>> >> This might be a realistic option for the current version but
I
> > >>> >> just
> > >>> wanted
> > >>> >> to call attention to this feature request.
> > >>> >> >
> > >>> >> > Best,
> > >>> >> > David
> > >>> >> >
> > >>> >> > On Nov 17, 2016, at 6:19 PM, Maxime Beauchemin <
> > >>> >> maximebeauchemin@gmail.com<mailto:maximebeauchemin@gmail.com>>
> > wrote:
> > >>> >> >
> > >>> >> > *This is a brainstorm email thread about Airflow 2.0!*
> > >>> >> >
> > >>> >> > I wanted to share some ideas around what I would like
to do in
> > >>> Airflow
> > >>> >> 2.0
> > >>> >> > and would love to hear what others are thinking. I'll
compile
> > >>> >> > the
> > >>> ideas
> > >>> >> > that are shared in this thread in a Wiki once the conversation
> > fades.
> > >>> >> >
> > >>> >> > -------------------------------------------
> > >>> >> >
> > >>> >> > First idea, to get the conversation started:
> > >>> >> >
> > >>> >> > *Breaking down the package*
> > >>> >> > `pip install airflow-common airflow-scheduler airflow-webserver
> > >>> >> > airflow-operators-googlecloud ...`
> > >>> >> >
> > >>> >> > It seems to me like we're getting to a point where having
> > >>> >> > different repositories and different packages would make
things
> > >>> >> > much easier in
> > >>> all
> > >>> >> > sorts of ways. For instance the web server is a lot less
> > >>> >> > sensitive
> > >>> than
> > >>> >> the
> > >>> >> > scheduler, and changes to operators should/could be deployed
at
> > >>> >> > will, independently from the main package. People in
their
> > >>> >> > environment
> > >>> could
> > >>> >> > upgrade only certain packages when needed. Travis builds
would
> > >>> >> > be
> > >>> more
> > >>> >> > targeted, and take less time, ...
> > >>> >> >
> > >>> >> > Also, the whole current "extra_requires" approach to
optional
> > >>> >> dependencies
> > >>> >> > (in setup.py) is kind getting out-of-hand.
> > >>> >> >
> > >>> >> > Of course `pip install airflow` would bring in a collection
of
> > >>> >> sub-packages
> > >>> >> > similar in functionality to what it does now, perhaps
without
> > >>> >> > so many operators you probably don't need in your environment.
> > >>> >> >
> > >>> >> > The release process is the main pain-point and the biggest
risk
> > >>> >> > for
> > >>> the
> > >>> >> > project, and I feel like this a solid solution to address
it.
> > >>> >> >
> > >>> >> > Max
> > >>> >> >
> > >>> >>
> > >>>
> > >>
> > >>
> >
> --
>
> Sergei
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message