airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kaxil Naik <kaxiln...@gmail.com>
Subject Re: Setting to add choice of schedule at end or schedule at start of interval
Date Thu, 26 Sep 2019 10:40:08 GMT
I definitely agree. If we don't update it in 2.0 it is going to be hard to
change that in any 2.x versions

On Thu, Sep 26, 2019 at 10:51 AM James Meickle
<jmeickle@quantopian.com.invalid> wrote:

> I am *strongly* in favor of using the 2.0 update to break compat here,
> because this is a very confusing feature to most new users of Airflow, but
> also will break a _lot_ of DAGs. I feel like if we don't change this in 2.0
> we probably won't for any 2.x either, which would be a shame.
>
> On Wed, Sep 25, 2019 at 8:33 PM Kaxil Naik <kaxilnaik@gmail.com> wrote:
>
> > I agree with Dan to change the default execution at start of the
> interval.
> >
> > How about adding this for 2.0 ??
> >
> > Don't want to keep delaying this if we have a consensus already.
> >
> > Regards,
> > Kaxil
> >
> >
> > On Fri, Aug 23, 2019, 15:39 Dan Davydov <ddavydov@twitter.com.invalid>
> > wrote:
> >
> > > What are people's feelings on changing the default execution to
> schedule
> > > interval start and communicating this to existing users in the Updating
> > > notes so that they can preserve the old behavior? Could potentially
> cause
> > > headaches for users who don't read the notes but I think it might make
> > > sense to bite the bullet at some point for more intuitive behavior
> > overall
> > > for new users.
> > >
> > > On Fri, Aug 23, 2019 at 10:29 AM Dan Davydov <ddavydov@twitter.com>
> > wrote:
> > >
> > > > I am for this change, since I feel like in general the start of the
> > > > interval is more intuitive (I have been working on Airflow for 3
> years
> > > and
> > > > this still trips me up). That being said I'm not sure how I feel
> about
> > > > allowing customization at DAG level instead of cluster level as it
> > makes
> > > it
> > > > harder to make assumptions about DAGs on the cluster for ops, though
> > > maybe
> > > > this isn't a huge deal given there are tools available that show you
> > why
> > > > tasks aren't running.
> > > >
> > > > I agree with Bole that we should communicate recommended migration
> > > > strategies if they can't be done automatically.
> > > >
> > > > I don't think I'm a fan for arbitrary customization of the interval
> > via a
> > > > callback, my feeling is this would not provide significant value and
> > > could
> > > > be an ops nightmare.
> > > >
> > > > On Fri, Aug 23, 2019 at 9:11 AM Jarek Potiuk <
> Jarek.Potiuk@polidea.com
> > >
> > > > wrote:
> > > >
> > > >> DST: I recall problems with DST especially when the hour goes back
> and
> > > the
> > > >> daily schedule time technically occurs twice the same day or does
> not
> > > >> occur
> > > >> at all. We have some code that chooses arbitrary the first occurence
> > in
> > > >> the
> > > >> latter case (there was a problem that it worked differently python
> 3.6
> > > vs
> > > >> 3.5 (!). But also the case when we move forward is an interesting
> > one. I
> > > >> am
> > > >> not 100% it will work correctly after changing the scheduling
> > mechanisms
> > > >> but it's rather easy to test and there is no harm adding it.
> > > >> There is a DST-specific logic implemented in our next/previous run
> > > >> calculation and I imagine it could get wrong.
> > > >>
> > > >> The tests I am talking about:
> > > >>
> > > >>
> > >
> >
> DagTest.test_following_previous_schedule_daily_dag_CEST_to_CET/DagTest.test_following_previous_schedule_daily_dag_CET_to_CEST.
> > > >>
> > > >> Re: arbitrary customisation/converting DAGs. I think there is no
> need
> > to
> > > >> convert existing dags - the default behaviour remains as it is as
> far
> > > as I
> > > >> understand. And this flag is much simpler to understand and reason
> > about
> > > >> than arbitrary function and it corresponds to real business cases:
> > > >>
> > > >> 1) schedule_at_interval_end = True -> wait for the data to be ready
> > for
> > > >> the
> > > >> interval (current/default behaviour related to processing batches
of
> > > data)
> > > >> 2) schedule_at_interval_end = False -> CRON-like behaviour where
we
> > > simply
> > > >> run arbitrary operation in regular intervals (more intuitive for
> > people
> > > >> who
> > > >> are used to CRON-like jobs)
> > > >>
> > > >> You can always build your schedule differently if you need something
> > > >> "in-between" IMHO.
> > > >>
> > > >> J.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Fri, Aug 23, 2019 at 8:44 AM James Meickle
> > > >> <jmeickle@quantopian.com.invalid> wrote:
> > > >>
> > > >> > This is a change to one of Airflow's core concepts, and it would
> > > >> require a
> > > >> > lot of work for existing DAGs to cut over to it. Given that,
my
> > > personal
> > > >> > preference would be to allow arbitrary customization rather than
> > just
> > > a
> > > >> bit
> > > >> > toggle. Such as allowing passing in a mapping function: given
an
> > > >> interval's
> > > >> > start date and end date, when should it be executed?
> > > >> >
> > > >> > On Fri, Aug 23, 2019 at 8:24 AM Jarek Potiuk <
> > > Jarek.Potiuk@polidea.com>
> > > >> > wrote:
> > > >> >
> > > >> > > Happy for it as well. There are a number of cases where
> scheduling
> > > at
> > > >> > start
> > > >> > > makes more sense and as we see Airflow is used now in multiple
> > cases
> > > >> > where
> > > >> > > there is no need to process data from an interval and wait
until
> > > that
> > > >> > data
> > > >> > > is ready.
> > > >> > > But indeed some more tests would be great - especially for
edge
> > > cases.
> > > >> > > Changig mid-air is one but I think there should be test
about
> > > Daylight
> > > >> > > Saving Time changing.
> > > >> > > There are some tests for DST so they just need to be extended
to
> > > cover
> > > >> > > those two different cases.
> > > >> > >
> > > >> > >
> > > >> > > J.
> > > >> > >
> > > >> > > On Fri, Aug 23, 2019 at 7:37 AM Kaxil Naik <kaxilnaik@gmail.com
> >
> > > >> wrote:
> > > >> > >
> > > >> > > > Happy for this feature to merged
> > > >> > > >
> > > >> > > > On Fri, Aug 23, 2019, 11:49 Ash Berlin-Taylor <ash@apache.org
> >
> > > >> wrote:
> > > >> > > >
> > > >> > > > > This has come up a few times before, someone has
now opened
> a
> > PR
> > > >> that
> > > >> > > > > makes this a global+per-dag setting:
> > > >> > > > > https://github.com/apache/airflow/pull/5787 and
it also
> > > includes
> > > >> > docs
> > > >> > > > > that I think does a good job of illustrating the
two modes.
> > > >> > > > >
> > > >> > > > > Does anyone object to this being merged? If no
one says
> > anything
> > > >> by
> > > >> > > > midday
> > > >> > > > > on Tuesday I will take that as assent and will
merge it.
> > > >> > > > >
> > > >> > > > > The docs from the PR included below.
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > > Ash
> > > >> > > > >
> > > >> > > > > Scheduled Time vs Execution Time
> > > >> > > > > ''''''''''''''''''''''''''''''''
> > > >> > > > >
> > > >> > > > > A DAG with a ``schedule_interval`` will execute
once per
> > > >> interval. By
> > > >> > > > > default, the execution of a DAG will occur at
the **end** of
> > the
> > > >> > > > > schedule interval.
> > > >> > > > >
> > > >> > > > > A few examples:
> > > >> > > > >
> > > >> > > > > - A DAG with ``schedule_interval='@hourly'``:
The DAG run
> that
> > > >> > > processes
> > > >> > > > > 2019-08-16 17:00 will start running just after
2019-08-16
> > > >> 17:59:59,
> > > >> > > > > i.e. once that hour is over.
> > > >> > > > > - A DAG with ``schedule_interval='@daily'``: The
DAG run
> that
> > > >> > processes
> > > >> > > > > 2019-08-16 will start running shortly after 2019-08-17
> 00:00.
> > > >> > > > >
> > > >> > > > > The reasoning behind this execution vs scheduling
behaviour
> is
> > > >> that
> > > >> > > > > data for the interval to be processed won't be
fully
> available
> > > >> until
> > > >> > > > > the interval has elapsed.
> > > >> > > > >
> > > >> > > > > In cases where you wish the DAG to be executed
at the
> > **start**
> > > of
> > > >> > the
> > > >> > > > > interval, specify ``schedule_at_interval_end=False``,
either
> > in
> > > >> > > > > ``airflow.cfg``, or on a per-DAG basis.
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > >
> > > >> > > Jarek Potiuk
> > > >> > > Polidea <https://www.polidea.com/> | Principal Software
> Engineer
> > > >> > >
> > > >> > > M: +48 660 796 129 <+48660796129>
> > > >> > > [image: Polidea] <https://www.polidea.com/>
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > > >> --
> > > >>
> > > >> Jarek Potiuk
> > > >> Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >>
> > > >> M: +48 660 796 129 <+48660796129>
> > > >> [image: Polidea] <https://www.polidea.com/>
> > > >>
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message