airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From siddharth anand <san...@apache.org>
Subject Re: Airflow 2.0
Date Tue, 22 Nov 2016 02:47:50 GMT
1) The restart should not be needed, but if folks are reporting it, I'm
curious what the problem might be. If yo are running on master, then you
may not be aware of the min_file_process_interval setting.

[scheduler]

min_file_process_interval = 0

max_threads = 4

2) Yes.. security is not there. It's often something added to a maturing
project a little late in its growth - after feature completeness,
performance, etc... For example, Azkaban grew at LinkedIn to be widely
adopted for a few years before Azkaban2 came around and introduced security
features. If it's important to you, then vote. It may not be there on your
timeframe, but it will surely be something we land in 2017. Also if you run
in the cloud, there are some options that be make your installation more
secure.

Great feedback. I know Max kicked this thread off in order to figure out
how to get his team to consider the community's needs when picking what to
fix. This information is in fact helpful to us all.

-s

On Mon, Nov 21, 2016 at 6:13 PM, Boris Tyukin <boris@boristyukin.com> wrote:

> I am still deciding between Airflow and oozie for our brand new Hadoop
> project but here is a few things that I did not like during my limited
> testing:
>
> 1) pain with scheduler/webserver restarts - things magically begin working
> after restart or disappear (like DAG tasks that are no longer part of DAG)
> 2) no security - a big deal for enterprise-like companies like the one I
> work for (a large healthcare organization).
> 3) backfill concept is a bit weird to me. I think Gerard put it pretty well
> - backfills should be run for the entire missing window, not day by day.
> Logging for backfills should be consistent with normal DAG Runs.
> 4) confusion around execution time and start time - i wish UI would clearly
> distinct them. Execution time only covers interval to a previous DAG run -
> I wish it would go back the LAST successful DAG run. That way I can rely on
> it to use it as watermarks for incremental processes.
> 5) UTC confusion - not all companies have a luxury to run all the systems
> on UTC.
>
>
> On Mon, Nov 21, 2016 at 5:26 PM, siddharth anand <sanand@apache.org>
> wrote:
>
> > Also, a survey will be a little less noisy and easier to summarize than
> +1s
> > in this email thread.
> > -s (Sid)
> >
> > On Mon, Nov 21, 2016 at 2:25 PM, siddharth anand <sanand@apache.org>
> > wrote:
> >
> > > Sergei,
> > > These are some great ideas -- I would classify at least half of them as
> > > pain points.
> > >
> > > Folks!
> > > I suggest people (on the dev list) keep feeding this thread at least
> for
> > > the next 2 days. I can then float a survey based on these ideas and
> give
> > > the community a chance to vote so we can prioritize the wish list.
> > >
> > > -s
> > >
> > > On Mon, Nov 21, 2016 at 5:22 AM, Sergei Iakhnin <llevar@gmail.com>
> > wrote:
> > >
> > >> I've been running Airflow on 1500 cores in the context of scientific
> > >> workflows for the past year and a half. Features that would be
> important
> > >> to
> > >> me for 2.0:
> > >>
> > >> - Add FK to dag_run to the task_instance table on Postgres so that
> > >> task_instances can be uniquely attributed to dag runs.
> > >> - Ensure scheduler can be run continuously without needing restarts.
> > Right
> > >> now it gets into some ill-determined bad state forcing me to restart
> it
> > >> every 20 minutes.
> > >> - Ensure scheduler can handle tens of thousands of active workflows.
> > Right
> > >> now this results in extremely long scheduling times and inconsistent
> > >> scheduling even at 2 thousand active workflows.
> > >> - Add more flexible task scheduling prioritization. The default
> > >> prioritization is the opposite of the behaviour I want. I would prefer
> > >> that
> > >> downstream tasks always have higher priority than upstream tasks to
> > cause
> > >> entire workflows to tend to complete sooner, rather than scheduling
> > tasks
> > >> from other workflows. Having a few scheduling prioritization
> strategies
> > >> would be beneficial here.
> > >> - Provide better support for manually-triggered DAGs on the UI i.e. by
> > >> showing them as queued.
> > >> - Provide some resource management capabilities via something like
> slots
> > >> that can be defined on workers and occupied by tasks. Using celery's
> > >> concurrency parameter at the airflow server level is too
> coarse-grained
> > as
> > >> it forces all workers to be the same, and does not allow proper
> resource
> > >> management when different workflow tasks have different resource
> > >> requirements thus hurting utilization (a worker could run 8 parallel
> > tasks
> > >> with small memory footprint, but only 1 task with large memory
> footprint
> > >> for instance).
> > >>
> > >> With best regards,
> > >>
> > >> Sergei.
> > >>
> > >>
> > >> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
> > >> ext-pavlo.ryabchuk@here.com>
> > >> wrote:
> > >>
> > >> > -1. We extremely rely on data profiling, as a pipeline health
> > monitoring
> > >> > tool
> > >> >
> > >> > -----Original Message-----
> > >> > From: Chris Riccomini [mailto:criccomini@apache.org]
> > >> > Sent: Saturday, November 19, 2016 1:57 AM
> > >> > To: dev@airflow.incubator.apache.org
> > >> > Subject: Re: Airflow 2.0
> > >> >
> > >> > > RIP out the charting application and the data profiler
> > >> >
> > >> > Yes please! +1
> > >> >
> > >> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
> > >> > maximebeauchemin@gmail.com> wrote:
> > >> > > Another point that may be controversial for Airflow 2.0: RIP
out
> the
> > >> > > charting application and the data profiler. Even though it's
nice
> to
> > >> > > have it there, it's just out of scope and has major security
> > >> > issues/implications.
> > >> > >
> > >> > > I'm not sure how popular it actually is. We may need to run a
> survey
> > >> > > at some point around this kind of questions.
> > >> > >
> > >> > > Max
> > >> > >
> > >> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
> > >> > > maximebeauchemin@gmail.com> wrote:
> > >> > >
> > >> > >> Using FAB's Model, we get pretty much all of that (REST API,
> > >> > >> auth/perms,
> > >> > >> CRUD) for free:
> > >> > >> https://emea01.safelinks.protection.outlook.com/?url=http%
> > >> 3A%2F%2Ffla
> > >> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7
> > >> C%7C0064f
> > >> > >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea649
> > >> 19%7C1&sd
> > >> > >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
> > >> > >> quickhowto.html?highlight=rest#exposed-methods
> > >> > >>
> > >> > >> I'm pretty intimate with FAB since I use it (and contributed
to
> it)
> > >> > >> for Superset/Caravel.
> > >> > >>
> > >> > >> All that's needed is to derive FAB's model class instead
of
> > >> > >> SqlAlchemy's model class (which FAB's model wraps and adds
> > >> > >> functionality to and is 100% compatible AFAICT).
> > >> > >>
> > >> > >> Max
> > >> > >>
> > >> > >> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini
> > >> > >> <criccomini@apache.org>
> > >> > >> wrote:
> > >> > >>
> > >> > >>> > It may be doable to run this as a different package
> > >> > >>> `airflow-webserver`, an
> > >> > >>> > alternate UI at first, and to eventually rip out
the old UI
> off
> > of
> > >> > >>> > the
> > >> > >>> main
> > >> > >>> > package.
> > >> > >>>
> > >> > >>> This is the same strategy that I was thinking of for
AIRFLOW-85.
> > You
> > >> > >>> can build the new UI in parallel, and then delete the
old one
> > later.
> > >> > >>> I really think that a REST interface should be a pre-req
to any
> > >> > >>> large/new UI changes, though. Getting unified so that
everything
> > is
> > >> > >>> driven through REST will be a big win.
> > >> > >>>
> > >> > >>> On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin
> > >> > >>> <maximebeauchemin@gmail.com> wrote:
> > >> > >>> > A multi-tenant UI with composable roles on top of
granular
> > >> > permissions.
> > >> > >>> >
> > >> > >>> > Migrating from Flask-Admin to Flask App Builder
would be an
> > >> > >>> > easy-ish win (since they're both Flask). FAB Provides
a good
> > >> > >>> > authentication and permission model that ships out-of-the-box
> > with
> > >> > >>> > a REST api. Suffice to define FAB models (derivative
of
> > >> > >>> > SQLAlchemy's model) and you get a set
> > >> > >>> of
> > >> > >>> > perms for the model (can_show, can_list, can_add,
can_change,
> > >> > >>> can_delete,
> > >> > >>> > ...) and a set of CRUD REST endpoints. It would
also allow us
> to
> > >> > >>> > rip out the authentication backend code out of Airflow
and
> rely
> > on
> > >> > FAB for that.
> > >> > >>> > Also every single view gets permissions auto-created
for it,
> and
> > >> > >>> > there
> > >> > >>> are
> > >> > >>> > easy way to define row-level type filters based
on user
> > >> permissions.
> > >> > >>> >
> > >> > >>> > It may be doable to run this as a different package
> > >> > >>> `airflow-webserver`, an
> > >> > >>> > alternate UI at first, and to eventually rip out
the old UI
> off
> > of
> > >> > >>> > the
> > >> > >>> main
> > >> > >>> > package.
> > >> > >>> >
> > >> > >>> > https://emea01.safelinks.protection.outlook.com/?url=https%
> > >> 3A%2F%2
> > >> > >>> > Fflask-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C
> > >> 01%7C%
> > >> > >>> > 7C0064f74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391f
> > >> eaea64
> > >> > >>> > 919%7C1&sdata=8mUPRcf4%2FQUDSbju%2BjLLImalhZeU7tOA%2BFpeO%
> > >> 2BjcEs8%
> > >> > >>> > 3D&reserved=0
> > >> > >>> >
> > >> > >>> > I'd love to carve some time and lead this.
> > >> > >>> >
> > >> > >>> > On Fri, Nov 18, 2016 at 1:32 PM, Chris Riccomini
> > >> > >>> > <criccomini@apache.org
> > >> > >>> >
> > >> > >>> > wrote:
> > >> > >>> >
> > >> > >>> >> Full-fledged REST API (that the UI also uses)
would be great
> in
> > >> 2.0.
> > >> > >>> >>
> > >> > >>> >> On Fri, Nov 18, 2016 at 6:26 AM, David Kegley
<kegs@b23.io>
> > >> wrote:
> > >> > >>> >> > Hi All,
> > >> > >>> >> >
> > >> > >>> >> > We have been using Airflow heavily for
the last couple
> months
> > >> > >>> >> > and
> > >> > >>> it’s
> > >> > >>> >> been great so far. Here are a few things we’d
like to see
> > >> > >>> >> prioritized
> > >> > >>> in
> > >> > >>> >> 2.0.
> > >> > >>> >> >
> > >> > >>> >> > 1) Role based access to DAGs:
> > >> > >>> >> > We would like to see better role based
access through the
> UI.
> > >> > >>> There’s a
> > >> > >>> >> related ticket out there but it hasn’t seen
any action in a
> few
> > >> > >>> >> months
> > >> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=
> https%
> > >> 3A%2
> > >> > >>> >> > F%2Fissues.apache.org%2Fjira%
> 2Fbrowse%2FAIRFLOW-85&data=01%7
> > >> C01
> > >> > >>> >> > %7C%7C0064f74fd0d940ab732808d4100e9c3f%
> > >> 7C6d4034cd72254f72b85391
> > >> > >>> >> > feaea64919%7C1&sdata=VsgwHZxr0%2FDQN1jeBTJsfyIGu%
> 2FZkkWhzAvx
> > >> NvB
> > >> > >>> >> > N531k%3D&reserved=0
> > >> > >>> >> >
> > >> > >>> >> > We use a templating system to create/deploy
DAGs
> dynamically
> > >> > >>> >> > based on
> > >> > >>> >> some directory/file structure. This allows analysts
to
> quickly
> > >> > >>> >> deploy
> > >> > >>> and
> > >> > >>> >> schedule their ETL code without having to interact
with the
> > >> > >>> >> Airflow installation directly. It would be great
if those
> same
> > >> > >>> >> analysts could access to their own DAGs in the
UI so that
> they
> > >> > >>> >> can clear DAG runs,
> > >> > >>> mark
> > >> > >>> >> success, etc. while keeping them away from our
core ETL and
> > other
> > >> > >>> >> people's/organization's DAGs. Some of this can
be
> accomplished
> > >> > >>> >> with
> > >> > >>> ‘filter
> > >> > >>> >> by owner’ but it doesn’t address the use
case where a DAG can
> > be
> > >> > >>> maintained
> > >> > >>> >> by multiple users in the same organization when
they have
> > >> > >>> >> separate
> > >> > >>> Airflow
> > >> > >>> >> user accounts.
> > >> > >>> >> >
> > >> > >>> >> > 2) An option to turn off backfill:
> > >> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=
> https%
> > >> 3A%2
> > >> > >>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-558&data=
> 01%
> > >> 7C0
> > >> > >>> >> > 1%7C%7C0064f74fd0d940ab732808d4100e9c3f%
> > >> 7C6d4034cd72254f72b8539
> > >> > >>> >> > 1feaea64919%7C1&sdata=Xkz7dTkFMEa4np19m4ML1VajVqVPNy%
> > >> 2BVSS5Y%2B
> > >> > >>> >> > Sm8Odk%3D&reserved=0 For cases where
a DAG does an insert
> > >> > >>> >> > overwrite on a table every day.
> > >> > >>> >> This might be a realistic option for the current
version but
> I
> > >> > >>> >> just
> > >> > >>> wanted
> > >> > >>> >> to call attention to this feature request.
> > >> > >>> >> >
> > >> > >>> >> > Best,
> > >> > >>> >> > David
> > >> > >>> >> >
> > >> > >>> >> > On Nov 17, 2016, at 6:19 PM, Maxime Beauchemin
<
> > >> > >>> >> maximebeauchemin@gmail.com<mailto:maximebeauchemin@gmail.com
> >>
> > >> > wrote:
> > >> > >>> >> >
> > >> > >>> >> > *This is a brainstorm email thread about
Airflow 2.0!*
> > >> > >>> >> >
> > >> > >>> >> > I wanted to share some ideas around what
I would like to do
> > in
> > >> > >>> Airflow
> > >> > >>> >> 2.0
> > >> > >>> >> > and would love to hear what others are
thinking. I'll
> compile
> > >> > >>> >> > the
> > >> > >>> ideas
> > >> > >>> >> > that are shared in this thread in a Wiki
once the
> > conversation
> > >> > fades.
> > >> > >>> >> >
> > >> > >>> >> > -------------------------------------------
> > >> > >>> >> >
> > >> > >>> >> > First idea, to get the conversation started:
> > >> > >>> >> >
> > >> > >>> >> > *Breaking down the package*
> > >> > >>> >> > `pip install airflow-common airflow-scheduler
> > airflow-webserver
> > >> > >>> >> > airflow-operators-googlecloud ...`
> > >> > >>> >> >
> > >> > >>> >> > It seems to me like we're getting to a
point where having
> > >> > >>> >> > different repositories and different packages
would make
> > things
> > >> > >>> >> > much easier in
> > >> > >>> all
> > >> > >>> >> > sorts of ways. For instance the web server
is a lot less
> > >> > >>> >> > sensitive
> > >> > >>> than
> > >> > >>> >> the
> > >> > >>> >> > scheduler, and changes to operators should/could
be
> deployed
> > at
> > >> > >>> >> > will, independently from the main package.
People in their
> > >> > >>> >> > environment
> > >> > >>> could
> > >> > >>> >> > upgrade only certain packages when needed.
Travis builds
> > would
> > >> > >>> >> > be
> > >> > >>> more
> > >> > >>> >> > targeted, and take less time, ...
> > >> > >>> >> >
> > >> > >>> >> > Also, the whole current "extra_requires"
approach to
> optional
> > >> > >>> >> dependencies
> > >> > >>> >> > (in setup.py) is kind getting out-of-hand.
> > >> > >>> >> >
> > >> > >>> >> > Of course `pip install airflow` would bring
in a collection
> > of
> > >> > >>> >> sub-packages
> > >> > >>> >> > similar in functionality to what it does
now, perhaps
> without
> > >> > >>> >> > so many operators you probably don't need
in your
> > environment.
> > >> > >>> >> >
> > >> > >>> >> > The release process is the main pain-point
and the biggest
> > risk
> > >> > >>> >> > for
> > >> > >>> the
> > >> > >>> >> > project, and I feel like this a solid solution
to address
> it.
> > >> > >>> >> >
> > >> > >>> >> > Max
> > >> > >>> >> >
> > >> > >>> >>
> > >> > >>>
> > >> > >>
> > >> > >>
> > >> >
> > >> --
> > >>
> > >> Sergei
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message