airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Riccomini <criccom...@apache.org>
Subject Re: Airflow 2.0
Date Mon, 21 Nov 2016 17:57:53 GMT
> Ensure scheduler can be run continuously without needing restarts

+1

On Mon, Nov 21, 2016 at 5:25 AM, David Batista <dba@hellofresh.com> wrote:
> A small request, which might be handy.
>
> Having the possibility to select multiple tasks and mark them as
> Success/Clear/etc.
>
> Allow the UI to select individual tasks (i.e., inside the Tree View) and
> then have a button to mark them as Success/Clear/etc.
>
> On 21 November 2016 at 14:22, Sergei Iakhnin <llevar@gmail.com> wrote:
>
>> I've been running Airflow on 1500 cores in the context of scientific
>> workflows for the past year and a half. Features that would be important to
>> me for 2.0:
>>
>> - Add FK to dag_run to the task_instance table on Postgres so that
>> task_instances can be uniquely attributed to dag runs.
>> - Ensure scheduler can be run continuously without needing restarts. Right
>> now it gets into some ill-determined bad state forcing me to restart it
>> every 20 minutes.
>> - Ensure scheduler can handle tens of thousands of active workflows. Right
>> now this results in extremely long scheduling times and inconsistent
>> scheduling even at 2 thousand active workflows.
>> - Add more flexible task scheduling prioritization. The default
>> prioritization is the opposite of the behaviour I want. I would prefer that
>> downstream tasks always have higher priority than upstream tasks to cause
>> entire workflows to tend to complete sooner, rather than scheduling tasks
>> from other workflows. Having a few scheduling prioritization strategies
>> would be beneficial here.
>> - Provide better support for manually-triggered DAGs on the UI i.e. by
>> showing them as queued.
>> - Provide some resource management capabilities via something like slots
>> that can be defined on workers and occupied by tasks. Using celery's
>> concurrency parameter at the airflow server level is too coarse-grained as
>> it forces all workers to be the same, and does not allow proper resource
>> management when different workflow tasks have different resource
>> requirements thus hurting utilization (a worker could run 8 parallel tasks
>> with small memory footprint, but only 1 task with large memory footprint
>> for instance).
>>
>> With best regards,
>>
>> Sergei.
>>
>>
>> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
>> ext-pavlo.ryabchuk@here.com>
>> wrote:
>>
>> > -1. We extremely rely on data profiling, as a pipeline health monitoring
>> > tool
>> >
>> > -----Original Message-----
>> > From: Chris Riccomini [mailto:criccomini@apache.org]
>> > Sent: Saturday, November 19, 2016 1:57 AM
>> > To: dev@airflow.incubator.apache.org
>> > Subject: Re: Airflow 2.0
>> >
>> > > RIP out the charting application and the data profiler
>> >
>> > Yes please! +1
>> >
>> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
>> > maximebeauchemin@gmail.com> wrote:
>> > > Another point that may be controversial for Airflow 2.0: RIP out the
>> > > charting application and the data profiler. Even though it's nice to
>> > > have it there, it's just out of scope and has major security
>> > issues/implications.
>> > >
>> > > I'm not sure how popular it actually is. We may need to run a survey
>> > > at some point around this kind of questions.
>> > >
>> > > Max
>> > >
>> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
>> > > maximebeauchemin@gmail.com> wrote:
>> > >
>> > >> Using FAB's Model, we get pretty much all of that (REST API,
>> > >> auth/perms,
>> > >> CRUD) for free:
>> > >> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ffla
>> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7C%7C0064f
>> > >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea64919%7C1&sd
>> > >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
>> > >> quickhowto.html?highlight=rest#exposed-methods
>> > >>
>> > >> I'm pretty intimate with FAB since I use it (and contributed to it)
>> > >> for Superset/Caravel.
>> > >>
>> > >> All that's needed is to derive FAB's model class instead of
>> > >> SqlAlchemy's model class (which FAB's model wraps and adds
>> > >> functionality to and is 100% compatible AFAICT).
>> > >>
>> > >> Max
>> > >>
>> > >> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini
>> > >> <criccomini@apache.org>
>> > >> wrote:
>> > >>
>> > >>> > It may be doable to run this as a different package
>> > >>> `airflow-webserver`, an
>> > >>> > alternate UI at first, and to eventually rip out the old UI
off of
>> > >>> > the
>> > >>> main
>> > >>> > package.
>> > >>>
>> > >>> This is the same strategy that I was thinking of for AIRFLOW-85.
You
>> > >>> can build the new UI in parallel, and then delete the old one later.
>> > >>> I really think that a REST interface should be a pre-req to any
>> > >>> large/new UI changes, though. Getting unified so that everything
is
>> > >>> driven through REST will be a big win.
>> > >>>
>> > >>> On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin
>> > >>> <maximebeauchemin@gmail.com> wrote:
>> > >>> > A multi-tenant UI with composable roles on top of granular
>> > permissions.
>> > >>> >
>> > >>> > Migrating from Flask-Admin to Flask App Builder would be an
>> > >>> > easy-ish win (since they're both Flask). FAB Provides a good
>> > >>> > authentication and permission model that ships out-of-the-box
with
>> > >>> > a REST api. Suffice to define FAB models (derivative of
>> > >>> > SQLAlchemy's model) and you get a set
>> > >>> of
>> > >>> > perms for the model (can_show, can_list, can_add, can_change,
>> > >>> can_delete,
>> > >>> > ...) and a set of CRUD REST endpoints. It would also allow
us to
>> > >>> > rip out the authentication backend code out of Airflow and
rely on
>> > FAB for that.
>> > >>> > Also every single view gets permissions auto-created for it,
and
>> > >>> > there
>> > >>> are
>> > >>> > easy way to define row-level type filters based on user
>> permissions.
>> > >>> >
>> > >>> > It may be doable to run this as a different package
>> > >>> `airflow-webserver`, an
>> > >>> > alternate UI at first, and to eventually rip out the old UI
off of
>> > >>> > the
>> > >>> main
>> > >>> > package.
>> > >>> >
>> > >>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
>> > >>> > Fflask-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7C%
>> > >>> > 7C0064f74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea64
>> > >>> > 919%7C1&sdata=8mUPRcf4%2FQUDSbju%2BjLLImalhZeU7tOA%2BFpeO%2BjcEs8%
>> > >>> > 3D&reserved=0
>> > >>> >
>> > >>> > I'd love to carve some time and lead this.
>> > >>> >
>> > >>> > On Fri, Nov 18, 2016 at 1:32 PM, Chris Riccomini
>> > >>> > <criccomini@apache.org
>> > >>> >
>> > >>> > wrote:
>> > >>> >
>> > >>> >> Full-fledged REST API (that the UI also uses) would be
great in
>> 2.0.
>> > >>> >>
>> > >>> >> On Fri, Nov 18, 2016 at 6:26 AM, David Kegley <kegs@b23.io>
>> wrote:
>> > >>> >> > Hi All,
>> > >>> >> >
>> > >>> >> > We have been using Airflow heavily for the last couple
months
>> > >>> >> > and
>> > >>> it’s
>> > >>> >> been great so far. Here are a few things we’d like to
see
>> > >>> >> prioritized
>> > >>> in
>> > >>> >> 2.0.
>> > >>> >> >
>> > >>> >> > 1) Role based access to DAGs:
>> > >>> >> > We would like to see better role based access through
the UI.
>> > >>> There’s a
>> > >>> >> related ticket out there but it hasn’t seen any action
in a few
>> > >>> >> months
>> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2
>> > >>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-85&data=01%7C01
>> > >>> >> > %7C%7C0064f74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391
>> > >>> >> > feaea64919%7C1&sdata=VsgwHZxr0%2FDQN1jeBTJsfyIGu%2FZkkWhzAvxNvB
>> > >>> >> > N531k%3D&reserved=0
>> > >>> >> >
>> > >>> >> > We use a templating system to create/deploy DAGs
dynamically
>> > >>> >> > based on
>> > >>> >> some directory/file structure. This allows analysts to
quickly
>> > >>> >> deploy
>> > >>> and
>> > >>> >> schedule their ETL code without having to interact with
the
>> > >>> >> Airflow installation directly. It would be great if those
same
>> > >>> >> analysts could access to their own DAGs in the UI so that
they
>> > >>> >> can clear DAG runs,
>> > >>> mark
>> > >>> >> success, etc. while keeping them away from our core ETL
and other
>> > >>> >> people's/organization's DAGs. Some of this can be accomplished
>> > >>> >> with
>> > >>> ‘filter
>> > >>> >> by owner’ but it doesn’t address the use case where
a DAG can be
>> > >>> maintained
>> > >>> >> by multiple users in the same organization when they have
>> > >>> >> separate
>> > >>> Airflow
>> > >>> >> user accounts.
>> > >>> >> >
>> > >>> >> > 2) An option to turn off backfill:
>> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2
>> > >>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-558&data=01%7C0
>> > >>> >> > 1%7C%7C0064f74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b8539
>> > >>> >> > 1feaea64919%7C1&sdata=Xkz7dTkFMEa4np19m4ML1VajVqVPNy%2BVSS5Y%2B
>> > >>> >> > Sm8Odk%3D&reserved=0 For cases where a DAG does
an insert
>> > >>> >> > overwrite on a table every day.
>> > >>> >> This might be a realistic option for the current version
but I
>> > >>> >> just
>> > >>> wanted
>> > >>> >> to call attention to this feature request.
>> > >>> >> >
>> > >>> >> > Best,
>> > >>> >> > David
>> > >>> >> >
>> > >>> >> > On Nov 17, 2016, at 6:19 PM, Maxime Beauchemin <
>> > >>> >> maximebeauchemin@gmail.com<mailto:maximebeauchemin@gmail.com>>
>> > wrote:
>> > >>> >> >
>> > >>> >> > *This is a brainstorm email thread about Airflow
2.0!*
>> > >>> >> >
>> > >>> >> > I wanted to share some ideas around what I would
like to do in
>> > >>> Airflow
>> > >>> >> 2.0
>> > >>> >> > and would love to hear what others are thinking.
I'll compile
>> > >>> >> > the
>> > >>> ideas
>> > >>> >> > that are shared in this thread in a Wiki once the
conversation
>> > fades.
>> > >>> >> >
>> > >>> >> > -------------------------------------------
>> > >>> >> >
>> > >>> >> > First idea, to get the conversation started:
>> > >>> >> >
>> > >>> >> > *Breaking down the package*
>> > >>> >> > `pip install airflow-common airflow-scheduler airflow-webserver
>> > >>> >> > airflow-operators-googlecloud ...`
>> > >>> >> >
>> > >>> >> > It seems to me like we're getting to a point where
having
>> > >>> >> > different repositories and different packages would
make things
>> > >>> >> > much easier in
>> > >>> all
>> > >>> >> > sorts of ways. For instance the web server is a lot
less
>> > >>> >> > sensitive
>> > >>> than
>> > >>> >> the
>> > >>> >> > scheduler, and changes to operators should/could
be deployed at
>> > >>> >> > will, independently from the main package. People
in their
>> > >>> >> > environment
>> > >>> could
>> > >>> >> > upgrade only certain packages when needed. Travis
builds would
>> > >>> >> > be
>> > >>> more
>> > >>> >> > targeted, and take less time, ...
>> > >>> >> >
>> > >>> >> > Also, the whole current "extra_requires" approach
to optional
>> > >>> >> dependencies
>> > >>> >> > (in setup.py) is kind getting out-of-hand.
>> > >>> >> >
>> > >>> >> > Of course `pip install airflow` would bring in a
collection of
>> > >>> >> sub-packages
>> > >>> >> > similar in functionality to what it does now, perhaps
without
>> > >>> >> > so many operators you probably don't need in your
environment.
>> > >>> >> >
>> > >>> >> > The release process is the main pain-point and the
biggest risk
>> > >>> >> > for
>> > >>> the
>> > >>> >> > project, and I feel like this a solid solution to
address it.
>> > >>> >> >
>> > >>> >> > Max
>> > >>> >> >
>> > >>> >>
>> > >>>
>> > >>
>> > >>
>> >
>> --
>>
>> Sergei
>>
>
>
>
> --
> *David Batista* *Data Engineer**, HelloFresh Global*
> Saarbrücker Str. 37a | 10405 Berlin
> dba@hellofresh.com <email@hellofresh.com>
>
> --
>
> [image: logo]
>   <http://www.facebook.com/hellofreshde>   <http://twitter.com/HelloFreshde>
>    <http://instagram.com/hellofreshde/>   <http://blog.hellofresh.de/>
> <https://app.adjust.com/ayje08?campaign=Hellofresh&deep_link=hellofresh%3A%2F%2F&post_deep_link=https%3A%2F%2Fwww.hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_source%3Demail_signature&fallback=https%3A%2F%2Fwww.hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_source%3Demail_signature>
>
> *HelloFresh App –Download Now!*
> <https://app.adjust.com/ayje08?campaign=Hellofresh&deep_link=hellofresh%3A%2F%2F&post_deep_link=https%3A%2F%2Fwww.hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_source%3Demail_signature&fallback=https%3A%2F%2Fwww.hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_source%3Demail_signature>
> *We're active in:*
> US <https://www.hellofresh.com/?utm_medium=email&utm_source=email_signature>
>  |  DE
> <https://www.hellofresh.de/?utm_medium=email&utm_source=email_signature> |
> UK
> <https://www.hellofresh.co.uk/?utm_medium=email&utm_source=email_signature>
> |  NL
> <https://www.hellofresh.nl/?utm_medium=email&utm_source=email_signature> |
> AU
> <https://www.hellofresh.com.au/?utm_medium=email&utm_source=email_signature>
>  |  BE
> <https://www.hellofresh.be/?utm_medium=email&utm_source=email_signature> |
> AT <https://www.hellofresh.at/?utm_medium=email&utm_source=email_signature>
> |  CH
> <https://www.hellofresh.ch/?utm_medium=email&utm_source=email_signature> |
> CA <https://www.hellofresh.ca/?utm_medium=email&utm_source=email_signature>
>
> www.HelloFreshGroup.com
> <http://www.hellofreshgroup.com/?utm_medium=email&utm_source=email_signature>
>
> We are hiring around the world – Click here to join us
> <https://www.hellofresh.com/jobs/?utm_medium=email&utm_source=email_signature>
>
> --
>
> <https://www.hellofresh.com/jobs/?utm_medium=email&utm_source=email_signature>
> HelloFresh AG, Berlin (Sitz der Gesellschaft) | Vorstände: Dominik S.
> Richter (Vorsitzender), Thomas W. Griesel, Christian Gärtner | Vorsitzender
> des Aufsichtsrats: Jeffrey Lieberman | Eingetragen beim Amtsgericht
> Charlottenburg, HRB 171666 B | USt-Id Nr.: DE 302210417
>
> *CONFIDENTIALITY NOTICE:* This message (including any attachments) is
> confidential and may be privileged. It may be read, copied and used only by
> the intended recipient. If you have received it in error please contact the
> sender (by return e-mail) immediately and delete this message. Any
> unauthorized use or dissemination of this message in whole or in parts is
> strictly prohibited.

Mime
View raw message