airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerard Toonstra <gtoons...@gmail.com>
Subject Re: Airflow 2.0
Date Mon, 21 Nov 2016 21:55:18 GMT
+1 on driving everything through a REST API including the UI. This unifies
the access to the scheduler and increases stability.

Consider running a very small webserver (node.js + socket.io), which
enables airflow to communicate scheduler events as they happen
to anything that connects to it through socket.io, including browsers. This
way, the scheduler can forward any task state changes to the UI
so that explicit refreshes are no longer needed. It is possible to make
this optional functionality. If the nodejs server is not there, it won't
affect the functionality, because standard REST still gets the latest state.



On Mon, Nov 21, 2016 at 6:57 PM, Chris Riccomini <criccomini@apache.org>
wrote:

> > Ensure scheduler can be run continuously without needing restarts
>
> +1
>
> On Mon, Nov 21, 2016 at 5:25 AM, David Batista <dba@hellofresh.com> wrote:
> > A small request, which might be handy.
> >
> > Having the possibility to select multiple tasks and mark them as
> > Success/Clear/etc.
> >
> > Allow the UI to select individual tasks (i.e., inside the Tree View) and
> > then have a button to mark them as Success/Clear/etc.
> >
> > On 21 November 2016 at 14:22, Sergei Iakhnin <llevar@gmail.com> wrote:
> >
> >> I've been running Airflow on 1500 cores in the context of scientific
> >> workflows for the past year and a half. Features that would be
> important to
> >> me for 2.0:
> >>
> >> - Add FK to dag_run to the task_instance table on Postgres so that
> >> task_instances can be uniquely attributed to dag runs.
> >> - Ensure scheduler can be run continuously without needing restarts.
> Right
> >> now it gets into some ill-determined bad state forcing me to restart it
> >> every 20 minutes.
> >> - Ensure scheduler can handle tens of thousands of active workflows.
> Right
> >> now this results in extremely long scheduling times and inconsistent
> >> scheduling even at 2 thousand active workflows.
> >> - Add more flexible task scheduling prioritization. The default
> >> prioritization is the opposite of the behaviour I want. I would prefer
> that
> >> downstream tasks always have higher priority than upstream tasks to
> cause
> >> entire workflows to tend to complete sooner, rather than scheduling
> tasks
> >> from other workflows. Having a few scheduling prioritization strategies
> >> would be beneficial here.
> >> - Provide better support for manually-triggered DAGs on the UI i.e. by
> >> showing them as queued.
> >> - Provide some resource management capabilities via something like slots
> >> that can be defined on workers and occupied by tasks. Using celery's
> >> concurrency parameter at the airflow server level is too coarse-grained
> as
> >> it forces all workers to be the same, and does not allow proper resource
> >> management when different workflow tasks have different resource
> >> requirements thus hurting utilization (a worker could run 8 parallel
> tasks
> >> with small memory footprint, but only 1 task with large memory footprint
> >> for instance).
> >>
> >> With best regards,
> >>
> >> Sergei.
> >>
> >>
> >> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
> >> ext-pavlo.ryabchuk@here.com>
> >> wrote:
> >>
> >> > -1. We extremely rely on data profiling, as a pipeline health
> monitoring
> >> > tool
> >> >
> >> > -----Original Message-----
> >> > From: Chris Riccomini [mailto:criccomini@apache.org]
> >> > Sent: Saturday, November 19, 2016 1:57 AM
> >> > To: dev@airflow.incubator.apache.org
> >> > Subject: Re: Airflow 2.0
> >> >
> >> > > RIP out the charting application and the data profiler
> >> >
> >> > Yes please! +1
> >> >
> >> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
> >> > maximebeauchemin@gmail.com> wrote:
> >> > > Another point that may be controversial for Airflow 2.0: RIP out the
> >> > > charting application and the data profiler. Even though it's nice
to
> >> > > have it there, it's just out of scope and has major security
> >> > issues/implications.
> >> > >
> >> > > I'm not sure how popular it actually is. We may need to run a survey
> >> > > at some point around this kind of questions.
> >> > >
> >> > > Max
> >> > >
> >> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
> >> > > maximebeauchemin@gmail.com> wrote:
> >> > >
> >> > >> Using FAB's Model, we get pretty much all of that (REST API,
> >> > >> auth/perms,
> >> > >> CRUD) for free:
> >> > >> https://emea01.safelinks.protection.outlook.com/?url=
> http%3A%2F%2Ffla
> >> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%
> 7C%7C0064f
> >> > >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea6
> 4919%7C1&sd
> >> > >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
> >> > >> quickhowto.html?highlight=rest#exposed-methods
> >> > >>
> >> > >> I'm pretty intimate with FAB since I use it (and contributed to
it)
> >> > >> for Superset/Caravel.
> >> > >>
> >> > >> All that's needed is to derive FAB's model class instead of
> >> > >> SqlAlchemy's model class (which FAB's model wraps and adds
> >> > >> functionality to and is 100% compatible AFAICT).
> >> > >>
> >> > >> Max
> >> > >>
> >> > >> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini
> >> > >> <criccomini@apache.org>
> >> > >> wrote:
> >> > >>
> >> > >>> > It may be doable to run this as a different package
> >> > >>> `airflow-webserver`, an
> >> > >>> > alternate UI at first, and to eventually rip out the
old UI off
> of
> >> > >>> > the
> >> > >>> main
> >> > >>> > package.
> >> > >>>
> >> > >>> This is the same strategy that I was thinking of for AIRFLOW-85.
> You
> >> > >>> can build the new UI in parallel, and then delete the old
one
> later.
> >> > >>> I really think that a REST interface should be a pre-req to
any
> >> > >>> large/new UI changes, though. Getting unified so that everything
> is
> >> > >>> driven through REST will be a big win.
> >> > >>>
> >> > >>> On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin
> >> > >>> <maximebeauchemin@gmail.com> wrote:
> >> > >>> > A multi-tenant UI with composable roles on top of granular
> >> > permissions.
> >> > >>> >
> >> > >>> > Migrating from Flask-Admin to Flask App Builder would
be an
> >> > >>> > easy-ish win (since they're both Flask). FAB Provides
a good
> >> > >>> > authentication and permission model that ships out-of-the-box
> with
> >> > >>> > a REST api. Suffice to define FAB models (derivative
of
> >> > >>> > SQLAlchemy's model) and you get a set
> >> > >>> of
> >> > >>> > perms for the model (can_show, can_list, can_add, can_change,
> >> > >>> can_delete,
> >> > >>> > ...) and a set of CRUD REST endpoints. It would also
allow us to
> >> > >>> > rip out the authentication backend code out of Airflow
and rely
> on
> >> > FAB for that.
> >> > >>> > Also every single view gets permissions auto-created
for it, and
> >> > >>> > there
> >> > >>> are
> >> > >>> > easy way to define row-level type filters based on user
> >> permissions.
> >> > >>> >
> >> > >>> > It may be doable to run this as a different package
> >> > >>> `airflow-webserver`, an
> >> > >>> > alternate UI at first, and to eventually rip out the
old UI off
> of
> >> > >>> > the
> >> > >>> main
> >> > >>> > package.
> >> > >>> >
> >> > >>> > https://emea01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2
> >> > >>> > Fflask-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%
> 7C01%7C%
> >> > >>> > 7C0064f74fd0d940ab732808d4100e9c3f%
> 7C6d4034cd72254f72b85391feaea64
> >> > >>> > 919%7C1&sdata=8mUPRcf4%2FQUDSbju%2BjLLImalhZeU7tOA%
> 2BFpeO%2BjcEs8%
> >> > >>> > 3D&reserved=0
> >> > >>> >
> >> > >>> > I'd love to carve some time and lead this.
> >> > >>> >
> >> > >>> > On Fri, Nov 18, 2016 at 1:32 PM, Chris Riccomini
> >> > >>> > <criccomini@apache.org
> >> > >>> >
> >> > >>> > wrote:
> >> > >>> >
> >> > >>> >> Full-fledged REST API (that the UI also uses) would
be great in
> >> 2.0.
> >> > >>> >>
> >> > >>> >> On Fri, Nov 18, 2016 at 6:26 AM, David Kegley <kegs@b23.io>
> >> wrote:
> >> > >>> >> > Hi All,
> >> > >>> >> >
> >> > >>> >> > We have been using Airflow heavily for the last
couple months
> >> > >>> >> > and
> >> > >>> it’s
> >> > >>> >> been great so far. Here are a few things we’d like
to see
> >> > >>> >> prioritized
> >> > >>> in
> >> > >>> >> 2.0.
> >> > >>> >> >
> >> > >>> >> > 1) Role based access to DAGs:
> >> > >>> >> > We would like to see better role based access
through the UI.
> >> > >>> There’s a
> >> > >>> >> related ticket out there but it hasn’t seen any
action in a few
> >> > >>> >> months
> >> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=
> https%3A%2
> >> > >>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-85&data=01%
> 7C01
> >> > >>> >> > %7C%7C0064f74fd0d940ab732808d4100e
> 9c3f%7C6d4034cd72254f72b85391
> >> > >>> >> > feaea64919%7C1&sdata=VsgwHZxr0%2FDQN1jeBTJsfyIGu%
> 2FZkkWhzAvxNvB
> >> > >>> >> > N531k%3D&reserved=0
> >> > >>> >> >
> >> > >>> >> > We use a templating system to create/deploy
DAGs dynamically
> >> > >>> >> > based on
> >> > >>> >> some directory/file structure. This allows analysts
to quickly
> >> > >>> >> deploy
> >> > >>> and
> >> > >>> >> schedule their ETL code without having to interact
with the
> >> > >>> >> Airflow installation directly. It would be great
if those same
> >> > >>> >> analysts could access to their own DAGs in the UI
so that they
> >> > >>> >> can clear DAG runs,
> >> > >>> mark
> >> > >>> >> success, etc. while keeping them away from our core
ETL and
> other
> >> > >>> >> people's/organization's DAGs. Some of this can be
accomplished
> >> > >>> >> with
> >> > >>> ‘filter
> >> > >>> >> by owner’ but it doesn’t address the use case
where a DAG can
> be
> >> > >>> maintained
> >> > >>> >> by multiple users in the same organization when they
have
> >> > >>> >> separate
> >> > >>> Airflow
> >> > >>> >> user accounts.
> >> > >>> >> >
> >> > >>> >> > 2) An option to turn off backfill:
> >> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=
> https%3A%2
> >> > >>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-558&data=
> 01%7C0
> >> > >>> >> > 1%7C%7C0064f74fd0d940ab732808d4100e
> 9c3f%7C6d4034cd72254f72b8539
> >> > >>> >> > 1feaea64919%7C1&sdata=Xkz7dTkFMEa4np19m4ML1VajVqVPNy
> %2BVSS5Y%2B
> >> > >>> >> > Sm8Odk%3D&reserved=0 For cases where a DAG
does an insert
> >> > >>> >> > overwrite on a table every day.
> >> > >>> >> This might be a realistic option for the current
version but I
> >> > >>> >> just
> >> > >>> wanted
> >> > >>> >> to call attention to this feature request.
> >> > >>> >> >
> >> > >>> >> > Best,
> >> > >>> >> > David
> >> > >>> >> >
> >> > >>> >> > On Nov 17, 2016, at 6:19 PM, Maxime Beauchemin
<
> >> > >>> >> maximebeauchemin@gmail.com<mailto:maximebeauchemin@gmail.com>>
> >> > wrote:
> >> > >>> >> >
> >> > >>> >> > *This is a brainstorm email thread about Airflow
2.0!*
> >> > >>> >> >
> >> > >>> >> > I wanted to share some ideas around what I would
like to do
> in
> >> > >>> Airflow
> >> > >>> >> 2.0
> >> > >>> >> > and would love to hear what others are thinking.
I'll compile
> >> > >>> >> > the
> >> > >>> ideas
> >> > >>> >> > that are shared in this thread in a Wiki once
the
> conversation
> >> > fades.
> >> > >>> >> >
> >> > >>> >> > -------------------------------------------
> >> > >>> >> >
> >> > >>> >> > First idea, to get the conversation started:
> >> > >>> >> >
> >> > >>> >> > *Breaking down the package*
> >> > >>> >> > `pip install airflow-common airflow-scheduler
> airflow-webserver
> >> > >>> >> > airflow-operators-googlecloud ...`
> >> > >>> >> >
> >> > >>> >> > It seems to me like we're getting to a point
where having
> >> > >>> >> > different repositories and different packages
would make
> things
> >> > >>> >> > much easier in
> >> > >>> all
> >> > >>> >> > sorts of ways. For instance the web server is
a lot less
> >> > >>> >> > sensitive
> >> > >>> than
> >> > >>> >> the
> >> > >>> >> > scheduler, and changes to operators should/could
be deployed
> at
> >> > >>> >> > will, independently from the main package. People
in their
> >> > >>> >> > environment
> >> > >>> could
> >> > >>> >> > upgrade only certain packages when needed. Travis
builds
> would
> >> > >>> >> > be
> >> > >>> more
> >> > >>> >> > targeted, and take less time, ...
> >> > >>> >> >
> >> > >>> >> > Also, the whole current "extra_requires" approach
to optional
> >> > >>> >> dependencies
> >> > >>> >> > (in setup.py) is kind getting out-of-hand.
> >> > >>> >> >
> >> > >>> >> > Of course `pip install airflow` would bring
in a collection
> of
> >> > >>> >> sub-packages
> >> > >>> >> > similar in functionality to what it does now,
perhaps without
> >> > >>> >> > so many operators you probably don't need in
your
> environment.
> >> > >>> >> >
> >> > >>> >> > The release process is the main pain-point and
the biggest
> risk
> >> > >>> >> > for
> >> > >>> the
> >> > >>> >> > project, and I feel like this a solid solution
to address it.
> >> > >>> >> >
> >> > >>> >> > Max
> >> > >>> >> >
> >> > >>> >>
> >> > >>>
> >> > >>
> >> > >>
> >> >
> >> --
> >>
> >> Sergei
> >>
> >
> >
> >
> > --
> > *David Batista* *Data Engineer**, HelloFresh Global*
> > Saarbrücker Str. 37a | 10405 Berlin
> > dba@hellofresh.com <email@hellofresh.com>
> >
> > --
> >
> > [image: logo]
> >   <http://www.facebook.com/hellofreshde>   <http://twitter.com/
> HelloFreshde>
> >    <http://instagram.com/hellofreshde/>   <http://blog.hellofresh.de/>
> > <https://app.adjust.com/ayje08?campaign=Hellofresh&
> deep_link=hellofresh%3A%2F%2F&post_deep_link=https%3A%2F%
> 2Fwww.hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_
> source%3Demail_signature&fallback=https%3A%2F%2Fwww.
> hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_source%
> 3Demail_signature>
> >
> > *HelloFresh App –Download Now!*
> > <https://app.adjust.com/ayje08?campaign=Hellofresh&
> deep_link=hellofresh%3A%2F%2F&post_deep_link=https%3A%2F%
> 2Fwww.hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_
> source%3Demail_signature&fallback=https%3A%2F%2Fwww.
> hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_source%
> 3Demail_signature>
> > *We're active in:*
> > US <https://www.hellofresh.com/?utm_medium=email&utm_source=
> email_signature>
> >  |  DE
> > <https://www.hellofresh.de/?utm_medium=email&utm_source=email_signature>
> |
> > UK
> > <https://www.hellofresh.co.uk/?utm_medium=email&utm_source=
> email_signature>
> > |  NL
> > <https://www.hellofresh.nl/?utm_medium=email&utm_source=email_signature>
> |
> > AU
> > <https://www.hellofresh.com.au/?utm_medium=email&utm_
> source=email_signature>
> >  |  BE
> > <https://www.hellofresh.be/?utm_medium=email&utm_source=email_signature>
> |
> > AT <https://www.hellofresh.at/?utm_medium=email&utm_source=
> email_signature>
> > |  CH
> > <https://www.hellofresh.ch/?utm_medium=email&utm_source=email_signature>
> |
> > CA <https://www.hellofresh.ca/?utm_medium=email&utm_source=
> email_signature>
> >
> > www.HelloFreshGroup.com
> > <http://www.hellofreshgroup.com/?utm_medium=email&utm_
> source=email_signature>
> >
> > We are hiring around the world – Click here to join us
> > <https://www.hellofresh.com/jobs/?utm_medium=email&utm_
> source=email_signature>
> >
> > --
> >
> > <https://www.hellofresh.com/jobs/?utm_medium=email&utm_
> source=email_signature>
> > HelloFresh AG, Berlin (Sitz der Gesellschaft) | Vorstände: Dominik S.
> > Richter (Vorsitzender), Thomas W. Griesel, Christian Gärtner |
> Vorsitzender
> > des Aufsichtsrats: Jeffrey Lieberman | Eingetragen beim Amtsgericht
> > Charlottenburg, HRB 171666 B | USt-Id Nr.: DE 302210417
> >
> > *CONFIDENTIALITY NOTICE:* This message (including any attachments) is
> > confidential and may be privileged. It may be read, copied and used only
> by
> > the intended recipient. If you have received it in error please contact
> the
> > sender (by return e-mail) immediately and delete this message. Any
> > unauthorized use or dissemination of this message in whole or in parts is
> > strictly prohibited.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message