airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gurer Kiratli <gurer.kira...@airbnb.com.INVALID>
Subject Re: Airflow 2.0
Date Mon, 12 Dec 2016 16:04:48 GMT
Hi folks,

Here is the list
<https://cwiki.apache.org/confluence/display/AIRFLOW/2017+Roadmap+Items> of
possible roadmap items for 2017. I think that clubbing deliverables into
1.9 or 2.0 is orthogonal to our high level 2017 planning so I went with
this approach.

Please take a look at the wiki and see if there is something missing or
needs further clarification by the end of the week and I will send out a
survey next week to get a sense of priorities across the community.

Let me know if you have any questions.

Cheers,

Gurer

On Tue, Dec 6, 2016 at 11:15 PM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> I spoke with Gurer yesterday and he's going to summarize and send a survey.
> It should be out this week.
>
> Max
>
> On Tue, Dec 6, 2016 at 7:24 PM, siddharth anand <sanand@apache.org> wrote:
>
> > Max,
> > Do you have time to summarize this thread? Perhaps, publish it on the
> Wiki!
> > -s
> >
> > On Thu, Dec 1, 2016 at 12:27 PM, Van Klaveren, Brian N. <
> > bvan@slac.stanford.edu> wrote:
> >
> > > With the announcement of AWS Batch (https://aws.amazon.com/batch/),
> and
> > > my own selfish needs, I think it'd be really great to generally support
> > > Batch systems like AWS Batch, Slurm, and Torque as executors,
> potentially
> > > with an extension of the BashOperator, but I think it might actually be
> > > flexible enough to not need a dedicated BatchOperator.
> > >
> > > Brian
> > >
> > >
> > > On Nov 24, 2016, at 7:40 AM, Maycock, Luke <luke.maycock@affiliate.
> > > oliverwyman.com<mailto:luke.maycock@affiliate.oliverwyman.com>> wrote:
> > >
> > > Add FK to dag_run to the task_instance table on Postgres so that
> > > task_instances can be uniquely attributed to dag runs.
> > >
> > >
> > > + 1
> > >
> > >
> > > Also, I believe xcoms would need to be addressed in the same way at the
> > > same time - I have added a comment to that affect on
> > > https://issues.apache.org/jira/browse/AIRFLOW-642
> > >
> > >
> > > I believe this would be implemented for all supported back-ends, not
> just
> > > PostgreSQL.
> > >
> > >
> > > Cheers,
> > > Luke Maycock
> > > OLIVER WYMAN
> > > luke.maycock@affiliate.oliverwyman.com<mailto:luke.
> > > maycock@affiliate.oliverwyman.com><mailto:luke.maycock@
> > > affiliate.oliverwyman.com>
> > > www.oliverwyman.com<http://www.oliverwyman.com><http://
> > > www.oliverwyman.com/>
> > >
> > >
> > >
> > > ________________________________
> > > From: Arunprasad Venkatraman <arpras@uber.com<mailto:arpras@uber.com>>
> > > Sent: 21 November 2016 18:16
> > > To: dev@airflow.incubator.apache.org<mailto:dev@airflow.
> > > incubator.apache.org>
> > > Subject: Re: Airflow 2.0
> > >
> > > Add FK to dag_run to the task_instance table on Postgres so that
> > > task_instances can be uniquely attributed to dag runs.
> > > Ensure scheduler can be run continuously without needing restarts.
> > > Ensure scheduler can handle tens of thousands of active workflows
> > >
> > > +1
> > >
> > > We are planning to run around 40,000 tasks a day using airflow and some
> > of
> > > them are critical to give quick feedback to developers. Currently
> having
> > > execution date to uniquely identify tasks does not work for us since we
> > > mainly trigger dags (instead of running them on schedule). And we
> collide
> > > with 1 sec granularity on several occasions.  Having a task uuid or
> > > associating dag_run to task_instance as suggested by Sergei table will
> > help
> > > mitigate this issue for us and would make it easy for us to update task
> > > results too. We would be happy to start working on this if it makes
> > sense.
> > >
> > > Also we are wondering if there were any work done in community to
> support
> > > multiple schedulers(or alternates to mysql/Postgres) because 1
> scheduler
> > > does not scale for us well and we see slow down of up to couple of
> minute
> > > sometimes when there are several pending tasks.
> > >
> > > Thanks
> > >
> > >
> > >
> > > On Mon, Nov 21, 2016 at 9:57 AM, Chris Riccomini <
> criccomini@apache.org
> > > <mailto:criccomini@apache.org>>
> > > wrote:
> > >
> > > Ensure scheduler can be run continuously without needing restarts
> > >
> > > +1
> > >
> > > On Mon, Nov 21, 2016 at 5:25 AM, David Batista <dba@hellofresh.com
> > <mailto:
> > > dba@hellofresh.com>> wrote:
> > > A small request, which might be handy.
> > >
> > > Having the possibility to select multiple tasks and mark them as
> > > Success/Clear/etc.
> > >
> > > Allow the UI to select individual tasks (i.e., inside the Tree View)
> and
> > > then have a button to mark them as Success/Clear/etc.
> > >
> > > On 21 November 2016 at 14:22, Sergei Iakhnin <llevar@gmail.com<mailto:
> > > llevar@gmail.com>> wrote:
> > >
> > > I've been running Airflow on 1500 cores in the context of scientific
> > > workflows for the past year and a half. Features that would be
> > > important to
> > > me for 2.0:
> > >
> > > - Add FK to dag_run to the task_instance table on Postgres so that
> > > task_instances can be uniquely attributed to dag runs.
> > > - Ensure scheduler can be run continuously without needing restarts.
> > > Right
> > > now it gets into some ill-determined bad state forcing me to restart it
> > > every 20 minutes.
> > > - Ensure scheduler can handle tens of thousands of active workflows.
> > > Right
> > > now this results in extremely long scheduling times and inconsistent
> > > scheduling even at 2 thousand active workflows.
> > > - Add more flexible task scheduling prioritization. The default
> > > prioritization is the opposite of the behaviour I want. I would prefer
> > > that
> > > downstream tasks always have higher priority than upstream tasks to
> > > cause
> > > entire workflows to tend to complete sooner, rather than scheduling
> > > tasks
> > > from other workflows. Having a few scheduling prioritization strategies
> > > would be beneficial here.
> > > - Provide better support for manually-triggered DAGs on the UI i.e. by
> > > showing them as queued.
> > > - Provide some resource management capabilities via something like
> slots
> > > that can be defined on workers and occupied by tasks. Using celery's
> > > concurrency parameter at the airflow server level is too coarse-grained
> > > as
> > > it forces all workers to be the same, and does not allow proper
> resource
> > > management when different workflow tasks have different resource
> > > requirements thus hurting utilization (a worker could run 8 parallel
> > > tasks
> > > with small memory footprint, but only 1 task with large memory
> footprint
> > > for instance).
> > >
> > > With best regards,
> > >
> > > Sergei.
> > >
> > >
> > > On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
> > > ext-pavlo.ryabchuk@here.com<mailto:ext-pavlo.ryabchuk@here.com>>
> > > wrote:
> > >
> > > -1. We extremely rely on data profiling, as a pipeline health
> > > monitoring
> > > tool
> > >
> > > -----Original Message-----
> > > From: Chris Riccomini [mailto:criccomini@apache.org]
> > > Sent: Saturday, November 19, 2016 1:57 AM
> > > To: dev@airflow.incubator.apache.org<mailto:dev@airflow.
> > > incubator.apache.org>
> > > Subject: Re: Airflow 2.0
> > >
> > > RIP out the charting application and the data profiler
> > >
> > > Yes please! +1
> > >
> > > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
> > > maximebeauchemin@gmail.com<mailto:maximebeauchemin@gmail.com>> wrote:
> > > Another point that may be controversial for Airflow 2.0: RIP out the
> > > charting application and the data profiler. Even though it's nice to
> > > have it there, it's just out of scope and has major security
> > > issues/implications.
> > >
> > > I'm not sure how popular it actually is. We may need to run a survey
> > > at some point around this kind of questions.
> > >
> > > Max
> > >
> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
> > > maximebeauchemin@gmail.com<mailto:maximebeauchemin@gmail.com>> wrote:
> > >
> > > Using FAB's Model, we get pretty much all of that (REST API,
> > > auth/perms,
> > > CRUD) for free:
> > > https://emea01.safelinks.protection.outlook.com/?url=
> > > http%3A%2F%2Ffla
> > > sk-appbuilder.readthedocs.io<http://sk-appbuilder.readthedocs.io
> > > >%2Fen%2Flatest%2F&data=01%7C01%
> > > 7C%7C0064f
> > > 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea6
> > > 4919%7C1&sd
> > > ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
> > > quickhowto.html?highlight=rest#exposed-methods
> > >
> > > I'm pretty intimate with FAB since I use it (and contributed to it)
> > > for Superset/Caravel.
> > >
> > > All that's needed is to derive FAB's model class instead of
> > > SqlAlchemy's model class (which FAB's model wraps and adds
> > > functionality to and is 100% compatible AFAICT).
> > >
> > > Max
> > >
> > > On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini
> > > <criccomini@apache.org<mailto:criccomini@apache.org>>
> > > wrote:
> > >
> > > It may be doable to run this as a different package
> > > `airflow-webserver`, an
> > > alternate UI at first, and to eventually rip out the old UI off
> > > of
> > > the
> > > main
> > > package.
> > >
> > > This is the same strategy that I was thinking of for AIRFLOW-85.
> > > You
> > > can build the new UI in parallel, and then delete the old one
> > > later.
> > > I really think that a REST interface should be a pre-req to any
> > > large/new UI changes, though. Getting unified so that everything
> > > is
> > > driven through REST will be a big win.
> > >
> > > On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin
> > > <maximebeauchemin@gmail.com<mailto:maximebeauchemin@gmail.com>>
wrote:
> > > A multi-tenant UI with composable roles on top of granular
> > > permissions.
> > >
> > > Migrating from Flask-Admin to Flask App Builder would be an
> > > easy-ish win (since they're both Flask). FAB Provides a good
> > > authentication and permission model that ships out-of-the-box
> > > with
> > > a REST api. Suffice to define FAB models (derivative of
> > > SQLAlchemy's model) and you get a set
> > > of
> > > perms for the model (can_show, can_list, can_add, can_change,
> > > can_delete,
> > > ...) and a set of CRUD REST endpoints. It would also allow us to
> > > rip out the authentication backend code out of Airflow and rely
> > > on
> > > FAB for that.
> > > Also every single view gets permissions auto-created for it, and
> > > there
> > > are
> > > easy way to define row-level type filters based on user
> > > permissions.
> > >
> > > It may be doable to run this as a different package
> > > `airflow-webserver`, an
> > > alternate UI at first, and to eventually rip out the old UI off
> > > of
> > > the
> > > main
> > > package.
> > >
> > > https://emea01.safelinks.protection.outlook.com/?url=
> > > https%3A%2F%2
> > > Fflask-appbuilder.readthedocs.io<http://Fflask-appbuilder.
> readthedocs.io
> > > >%2Fen%2Flatest%2F&data=01%
> > > 7C01%7C%
> > > 7C0064f74fd0d940ab732808d4100e9c3f%
> > > 7C6d4034cd72254f72b85391feaea64
> > > 919%7C1&sdata=8mUPRcf4%2FQUDSbju%2BjLLImalhZeU7tOA%
> > > 2BFpeO%2BjcEs8%
> > > 3D&reserved=0
> > >
> > > I'd love to carve some time and lead this.
> > >
> > > On Fri, Nov 18, 2016 at 1:32 PM, Chris Riccomini
> > > <criccomini@apache.org<mailto:criccomini@apache.org>
> > >
> > > wrote:
> > >
> > > Full-fledged REST API (that the UI also uses) would be great in
> > > 2.0.
> > >
> > > On Fri, Nov 18, 2016 at 6:26 AM, David Kegley <kegs@b23.io<mailto:
> > > kegs@b23.io>>
> > > wrote:
> > > Hi All,
> > >
> > > We have been using Airflow heavily for the last couple months
> > > and
> > > it’s
> > > been great so far. Here are a few things we’d like to see
> > > prioritized
> > > in
> > > 2.0.
> > >
> > > 1) Role based access to DAGs:
> > > We would like to see better role based access through the UI.
> > > There’s a
> > > related ticket out there but it hasn’t seen any action in a few
> > > months
> > > https://emea01.safelinks.protection.outlook.com/?url=
> > > https%3A%2
> > > F%2Fissues.apache.org<http://2Fissues.apache.org>%2Fjira%
> > > 2Fbrowse%2FAIRFLOW-85&data=01%
> > > 7C01
> > > %7C%7C0064f74fd0d940ab732808d4100e
> > > 9c3f%7C6d4034cd72254f72b85391
> > > feaea64919%7C1&sdata=VsgwHZxr0%2FDQN1jeBTJsfyIGu%
> > > 2FZkkWhzAvxNvB
> > > N531k%3D&reserved=0
> > >
> > > We use a templating system to create/deploy DAGs dynamically
> > > based on
> > > some directory/file structure. This allows analysts to quickly
> > > deploy
> > > and
> > > schedule their ETL code without having to interact with the
> > > Airflow installation directly. It would be great if those same
> > > analysts could access to their own DAGs in the UI so that they
> > > can clear DAG runs,
> > > mark
> > > success, etc. while keeping them away from our core ETL and
> > > other
> > > people's/organization's DAGs. Some of this can be accomplished
> > > with
> > > ‘filter
> > > by owner’ but it doesn’t address the use case where a DAG can
> > > be
> > > maintained
> > > by multiple users in the same organization when they have
> > > separate
> > > Airflow
> > > user accounts.
> > >
> > > 2) An option to turn off backfill:
> > > https://emea01.safelinks.protection.outlook.com/?url=
> > > https%3A%2
> > > F%2Fissues.apache.org<http://2Fissues.apache.org>%2Fjira%
> > > 2Fbrowse%2FAIRFLOW-558&data=
> > > 01%7C0
> > > 1%7C%7C0064f74fd0d940ab732808d4100e
> > > 9c3f%7C6d4034cd72254f72b8539
> > > 1feaea64919%7C1&sdata=Xkz7dTkFMEa4np19m4ML1VajVqVPNy
> > > %2BVSS5Y%2B
> > > Sm8Odk%3D&reserved=0 For cases where a DAG does an insert
> > > overwrite on a table every day.
> > > This might be a realistic option for the current version but I
> > > just
> > > wanted
> > > to call attention to this feature request.
> > >
> > > Best,
> > > David
> > >
> > > On Nov 17, 2016, at 6:19 PM, Maxime Beauchemin <
> > > maximebeauchemin@gmail.com<mailto:maximebeauchemin@gmail.com><mailto:
> > > maximebeauchemin@gmail.com>>
> > > wrote:
> > >
> > > *This is a brainstorm email thread about Airflow 2.0!*
> > >
> > > I wanted to share some ideas around what I would like to do
> > > in
> > > Airflow
> > > 2.0
> > > and would love to hear what others are thinking. I'll compile
> > > the
> > > ideas
> > > that are shared in this thread in a Wiki once the
> > > conversation
> > > fades.
> > >
> > > -------------------------------------------
> > >
> > > First idea, to get the conversation started:
> > >
> > > *Breaking down the package*
> > > `pip install airflow-common airflow-scheduler
> > > airflow-webserver
> > > airflow-operators-googlecloud ...`
> > >
> > > It seems to me like we're getting to a point where having
> > > different repositories and different packages would make
> > > things
> > > much easier in
> > > all
> > > sorts of ways. For instance the web server is a lot less
> > > sensitive
> > > than
> > > the
> > > scheduler, and changes to operators should/could be deployed
> > > at
> > > will, independently from the main package. People in their
> > > environment
> > > could
> > > upgrade only certain packages when needed. Travis builds
> > > would
> > > be
> > > more
> > > targeted, and take less time, ...
> > >
> > > Also, the whole current "extra_requires" approach to optional
> > > dependencies
> > > (in setup.py) is kind getting out-of-hand.
> > >
> > > Of course `pip install airflow` would bring in a collection
> > > of
> > > sub-packages
> > > similar in functionality to what it does now, perhaps without
> > > so many operators you probably don't need in your
> > > environment.
> > >
> > > The release process is the main pain-point and the biggest
> > > risk
> > > for
> > > the
> > > project, and I feel like this a solid solution to address it.
> > >
> > > Max
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > >
> > > Sergei
> > >
> > >
> > >
> > >
> > > --
> > > *David Batista* *Data Engineer**, HelloFresh Global*
> > > Saarbrücker Str. 37a | 10405 Berlin
> > > dba@hellofresh.com<mailto:dba@hellofresh.com> <email@hellofresh.com
> > > <mailto:email@hellofresh.com>>
> > >
> > > --
> > >
> > > [image: logo]
> > >  <http://www.facebook.com/hellofreshde>   <http://twitter.com/
> > > HelloFreshde>
> > >   <http://instagram.com/hellofreshde/>   <http://blog.hellofresh.de/>
> > > <https://app.adjust.com/ayje08?campaign=Hellofresh&
> > > deep_link=hellofresh%3A%2F%2F&post_deep_link=https%3A%2F%
> > > 2Fwww.hellofresh.com<http://2Fwww.hellofresh.com>%2Fapp%
> > > 2F%3Futm_medium%3Demail%26utm_
> > > source%3Demail_signature&fallback=https%3A%2F%2Fwww.
> > > hellofresh.com<http://hellofresh.com>%2Fapp%2F%
> > 3Futm_medium%3Demail%26utm_
> > > source%
> > > 3Demail_signature>
> > >
> > > *HelloFresh App –Download Now!*
> > > <https://app.adjust.com/ayje08?campaign=Hellofresh&
> > > deep_link=hellofresh%3A%2F%2F&post_deep_link=https%3A%2F%
> > > 2Fwww.hellofresh.com<http://2Fwww.hellofresh.com>%2Fapp%
> > > 2F%3Futm_medium%3Demail%26utm_
> > > source%3Demail_signature&fallback=https%3A%2F%2Fwww.
> > > hellofresh.com<http://hellofresh.com>%2Fapp%2F%
> > 3Futm_medium%3Demail%26utm_
> > > source%
> > > 3Demail_signature>
> > > *We're active in:*
> > > US <https://www.hellofresh.com/?utm_medium=email&utm_source=
> > > email_signature>
> > > |  DE
> > > <https://www.hellofresh.de/?utm_medium=email&utm_source=
> email_signature>
> > > |
> > > UK
> > > <https://www.hellofresh.co.uk/?utm_medium=email&utm_source=
> > > email_signature>
> > > |  NL
> > > <https://www.hellofresh.nl/?utm_medium=email&utm_source=
> email_signature>
> > > |
> > > AU
> > > <https://www.hellofresh.com.au/?utm_medium=email&utm_
> > > source=email_signature>
> > > |  BE
> > > <https://www.hellofresh.be/?utm_medium=email&utm_source=
> email_signature>
> > > |
> > > AT <https://www.hellofresh.at/?utm_medium=email&utm_source=
> > > email_signature>
> > > |  CH
> > > <https://www.hellofresh.ch/?utm_medium=email&utm_source=
> email_signature>
> > > |
> > > CA <https://www.hellofresh.ca/?utm_medium=email&utm_source=
> > > email_signature>
> > >
> > > www.HelloFreshGroup.com<http://www.HelloFreshGroup.com>
> > > <http://www.hellofreshgroup.com/?utm_medium=email&utm_
> > > source=email_signature>
> > >
> > > We are hiring around the world – Click here to join us
> > > <https://www.hellofresh.com/jobs/?utm_medium=email&utm_
> > > source=email_signature>
> > >
> > > --
> > >
> > > <https://www.hellofresh.com/jobs/?utm_medium=email&utm_
> > > source=email_signature>
> > > HelloFresh AG, Berlin (Sitz der Gesellschaft) | Vorstände: Dominik S.
> > > Richter (Vorsitzender), Thomas W. Griesel, Christian Gärtner |
> > > Vorsitzender
> > > des Aufsichtsrats: Jeffrey Lieberman | Eingetragen beim Amtsgericht
> > > Charlottenburg, HRB 171666 B | USt-Id Nr.: DE 302210417
> > >
> > > *CONFIDENTIALITY NOTICE:* This message (including any attachments) is
> > > confidential and may be privileged. It may be read, copied and used
> only
> > > by
> > > the intended recipient. If you have received it in error please contact
> > > the
> > > sender (by return e-mail) immediately and delete this message. Any
> > > unauthorized use or dissemination of this message in whole or in parts
> is
> > > strictly prohibited.
> > >
> > >
> > > ________________________________
> > > This e-mail and any attachments may be confidential or legally
> > privileged.
> > > If you received this message in error or are not the intended
> recipient,
> > > you should destroy the e-mail message and any attachments or copies,
> and
> > > you are prohibited from retaining, distributing, disclosing or using
> any
> > > information contained herein. Please inform us of the erroneous
> delivery
> > by
> > > return e-mail. Thank you for your cooperation.
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message