airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Coder <jcode...@gmail.com>
Subject Re: Setting to add choice of schedule at end or schedule at start of interval
Date Thu, 26 Sep 2019 10:54:11 GMT
I think that will lead to a very large number of questions about why it worked before and now
it doesn’t when doing a clean install. 
And additionally, if developing in a new install and deploying to an old install, you would
get different behavior. Adding to more confusion. 

James Coder

> On Sep 26, 2019, at 6:41 AM, Ash Berlin-Taylor <ash@apache.org> wrote:
> 
> I'm wondering if there is some way we can do this so that new installs will pick up the
new default, but anyone that carries over an Airflow.cfg from an old install will keep their
existing behaviour.
> 
> And then also is that a good/helpful idea or will that be more confusing than not?
> 
> -a
> 
>> On 26 Sep 2019, at 11:40, Kaxil Naik <kaxilnaik@gmail.com> wrote:
>> 
>> I definitely agree. If we don't update it in 2.0 it is going to be hard to
>> change that in any 2.x versions
>> 
>> On Thu, Sep 26, 2019 at 10:51 AM James Meickle
>> <jmeickle@quantopian.com.invalid> wrote:
>> 
>>> I am *strongly* in favor of using the 2.0 update to break compat here,
>>> because this is a very confusing feature to most new users of Airflow, but
>>> also will break a _lot_ of DAGs. I feel like if we don't change this in 2.0
>>> we probably won't for any 2.x either, which would be a shame.
>>> 
>>>> On Wed, Sep 25, 2019 at 8:33 PM Kaxil Naik <kaxilnaik@gmail.com> wrote:
>>>> 
>>>> I agree with Dan to change the default execution at start of the
>>> interval.
>>>> 
>>>> How about adding this for 2.0 ??
>>>> 
>>>> Don't want to keep delaying this if we have a consensus already.
>>>> 
>>>> Regards,
>>>> Kaxil
>>>> 
>>>> 
>>>> On Fri, Aug 23, 2019, 15:39 Dan Davydov <ddavydov@twitter.com.invalid>
>>>> wrote:
>>>> 
>>>>> What are people's feelings on changing the default execution to
>>> schedule
>>>>> interval start and communicating this to existing users in the Updating
>>>>> notes so that they can preserve the old behavior? Could potentially
>>> cause
>>>>> headaches for users who don't read the notes but I think it might make
>>>>> sense to bite the bullet at some point for more intuitive behavior
>>>> overall
>>>>> for new users.
>>>>> 
>>>>> On Fri, Aug 23, 2019 at 10:29 AM Dan Davydov <ddavydov@twitter.com>
>>>> wrote:
>>>>> 
>>>>>> I am for this change, since I feel like in general the start of the
>>>>>> interval is more intuitive (I have been working on Airflow for 3
>>> years
>>>>> and
>>>>>> this still trips me up). That being said I'm not sure how I feel
>>> about
>>>>>> allowing customization at DAG level instead of cluster level as it
>>>> makes
>>>>> it
>>>>>> harder to make assumptions about DAGs on the cluster for ops, though
>>>>> maybe
>>>>>> this isn't a huge deal given there are tools available that show
you
>>>> why
>>>>>> tasks aren't running.
>>>>>> 
>>>>>> I agree with Bole that we should communicate recommended migration
>>>>>> strategies if they can't be done automatically.
>>>>>> 
>>>>>> I don't think I'm a fan for arbitrary customization of the interval
>>>> via a
>>>>>> callback, my feeling is this would not provide significant value
and
>>>>> could
>>>>>> be an ops nightmare.
>>>>>> 
>>>>>> On Fri, Aug 23, 2019 at 9:11 AM Jarek Potiuk <
>>> Jarek.Potiuk@polidea.com
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> DST: I recall problems with DST especially when the hour goes
back
>>> and
>>>>> the
>>>>>>> daily schedule time technically occurs twice the same day or
does
>>> not
>>>>>>> occur
>>>>>>> at all. We have some code that chooses arbitrary the first occurence
>>>> in
>>>>>>> the
>>>>>>> latter case (there was a problem that it worked differently python
>>> 3.6
>>>>> vs
>>>>>>> 3.5 (!). But also the case when we move forward is an interesting
>>>> one. I
>>>>>>> am
>>>>>>> not 100% it will work correctly after changing the scheduling
>>>> mechanisms
>>>>>>> but it's rather easy to test and there is no harm adding it.
>>>>>>> There is a DST-specific logic implemented in our next/previous
run
>>>>>>> calculation and I imagine it could get wrong.
>>>>>>> 
>>>>>>> The tests I am talking about:
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> DagTest.test_following_previous_schedule_daily_dag_CEST_to_CET/DagTest.test_following_previous_schedule_daily_dag_CET_to_CEST.
>>>>>>> 
>>>>>>> Re: arbitrary customisation/converting DAGs. I think there is
no
>>> need
>>>> to
>>>>>>> convert existing dags - the default behaviour remains as it is
as
>>> far
>>>>> as I
>>>>>>> understand. And this flag is much simpler to understand and reason
>>>> about
>>>>>>> than arbitrary function and it corresponds to real business cases:
>>>>>>> 
>>>>>>> 1) schedule_at_interval_end = True -> wait for the data to
be ready
>>>> for
>>>>>>> the
>>>>>>> interval (current/default behaviour related to processing batches
of
>>>>> data)
>>>>>>> 2) schedule_at_interval_end = False -> CRON-like behaviour
where we
>>>>> simply
>>>>>>> run arbitrary operation in regular intervals (more intuitive
for
>>>> people
>>>>>>> who
>>>>>>> are used to CRON-like jobs)
>>>>>>> 
>>>>>>> You can always build your schedule differently if you need something
>>>>>>> "in-between" IMHO.
>>>>>>> 
>>>>>>> J.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Aug 23, 2019 at 8:44 AM James Meickle
>>>>>>> <jmeickle@quantopian.com.invalid> wrote:
>>>>>>> 
>>>>>>>> This is a change to one of Airflow's core concepts, and it
would
>>>>>>> require a
>>>>>>>> lot of work for existing DAGs to cut over to it. Given that,
my
>>>>> personal
>>>>>>>> preference would be to allow arbitrary customization rather
than
>>>> just
>>>>> a
>>>>>>> bit
>>>>>>>> toggle. Such as allowing passing in a mapping function: given
an
>>>>>>> interval's
>>>>>>>> start date and end date, when should it be executed?
>>>>>>>> 
>>>>>>>> On Fri, Aug 23, 2019 at 8:24 AM Jarek Potiuk <
>>>>> Jarek.Potiuk@polidea.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Happy for it as well. There are a number of cases where
>>> scheduling
>>>>> at
>>>>>>>> start
>>>>>>>>> makes more sense and as we see Airflow is used now in
multiple
>>>> cases
>>>>>>>> where
>>>>>>>>> there is no need to process data from an interval and
wait until
>>>>> that
>>>>>>>> data
>>>>>>>>> is ready.
>>>>>>>>> But indeed some more tests would be great - especially
for edge
>>>>> cases.
>>>>>>>>> Changig mid-air is one but I think there should be test
about
>>>>> Daylight
>>>>>>>>> Saving Time changing.
>>>>>>>>> There are some tests for DST so they just need to be
extended to
>>>>> cover
>>>>>>>>> those two different cases.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> J.
>>>>>>>>> 
>>>>>>>>> On Fri, Aug 23, 2019 at 7:37 AM Kaxil Naik <kaxilnaik@gmail.com
>>>> 
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Happy for this feature to merged
>>>>>>>>>> 
>>>>>>>>>> On Fri, Aug 23, 2019, 11:49 Ash Berlin-Taylor <ash@apache.org
>>>> 
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> This has come up a few times before, someone
has now opened
>>> a
>>>> PR
>>>>>>> that
>>>>>>>>>>> makes this a global+per-dag setting:
>>>>>>>>>>> https://github.com/apache/airflow/pull/5787 and
it also
>>>>> includes
>>>>>>>> docs
>>>>>>>>>>> that I think does a good job of illustrating
the two modes.
>>>>>>>>>>> 
>>>>>>>>>>> Does anyone object to this being merged? If no
one says
>>>> anything
>>>>>>> by
>>>>>>>>>> midday
>>>>>>>>>>> on Tuesday I will take that as assent and will
merge it.
>>>>>>>>>>> 
>>>>>>>>>>> The docs from the PR included below.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ash
>>>>>>>>>>> 
>>>>>>>>>>> Scheduled Time vs Execution Time
>>>>>>>>>>> ''''''''''''''''''''''''''''''''
>>>>>>>>>>> 
>>>>>>>>>>> A DAG with a ``schedule_interval`` will execute
once per
>>>>>>> interval. By
>>>>>>>>>>> default, the execution of a DAG will occur at
the **end** of
>>>> the
>>>>>>>>>>> schedule interval.
>>>>>>>>>>> 
>>>>>>>>>>> A few examples:
>>>>>>>>>>> 
>>>>>>>>>>> - A DAG with ``schedule_interval='@hourly'``:
The DAG run
>>> that
>>>>>>>>> processes
>>>>>>>>>>> 2019-08-16 17:00 will start running just after
2019-08-16
>>>>>>> 17:59:59,
>>>>>>>>>>> i.e. once that hour is over.
>>>>>>>>>>> - A DAG with ``schedule_interval='@daily'``:
The DAG run
>>> that
>>>>>>>> processes
>>>>>>>>>>> 2019-08-16 will start running shortly after 2019-08-17
>>> 00:00.
>>>>>>>>>>> 
>>>>>>>>>>> The reasoning behind this execution vs scheduling
behaviour
>>> is
>>>>>>> that
>>>>>>>>>>> data for the interval to be processed won't be
fully
>>> available
>>>>>>> until
>>>>>>>>>>> the interval has elapsed.
>>>>>>>>>>> 
>>>>>>>>>>> In cases where you wish the DAG to be executed
at the
>>>> **start**
>>>>> of
>>>>>>>> the
>>>>>>>>>>> interval, specify ``schedule_at_interval_end=False``,
either
>>>> in
>>>>>>>>>>> ``airflow.cfg``, or on a per-DAG basis.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> 
>>>>>>>>> Jarek Potiuk
>>>>>>>>> Polidea <https://www.polidea.com/> | Principal
Software
>>> Engineer
>>>>>>>>> 
>>>>>>>>> M: +48 660 796 129 <+48660796129>
>>>>>>>>> [image: Polidea] <https://www.polidea.com/>
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> 
>>>>>>> Jarek Potiuk
>>>>>>> Polidea <https://www.polidea.com/> | Principal Software
Engineer
>>>>>>> 
>>>>>>> M: +48 660 796 129 <+48660796129>
>>>>>>> [image: Polidea] <https://www.polidea.com/>
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
> 

Mime
View raw message