airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerard Toonstra <gtoons...@gmail.com>
Subject Re: SLA semantics
Date Wed, 26 Jun 2019 19:45:10 GMT
That's not my experience of how SLA's work at the moment. I've observed
this to currrently work as:

1. An SLA is configured as the "time delta" after some dag execution
schedule.
2. The SLA is configured at task level, so any tasks still running or need
to run after "time delta" will be aggregated together in one "SLA email".
3. The email is sent only once at the time the SLA misses in the "dag run".
4. The email is sent by the scheduler, not some worker.

What I did notice:

* If the scheduler cannot contact an email server, it will delay the
scheduler loop.
* As the emails do not get sent, it will try again next time the dag
configured with an SLA gets parsed, thus again impacting the scheduler loop.
* If the SLA emails do not succeed and later on they do, you get a huge
email with everything combined.

What we decided is not to rely on airflow SLA's, but to enforce and detect
SLA's externally based on success/fail metadata that we receive from
airflow.

The rationale is:
* we want to get better insights when workflows (dags) are completed
anyway, so we wanted dag completion data available elsewhere outside the
airflow db.,
* we want to avoid any negative impact on the main scheduler loop due to
mailing system availability.


On Wed, Jun 26, 2019 at 9:18 PM Andrew Stahlman <astahlman@lyft.com.invalid>
wrote:

> Hi all,
>
> I'm looking to get some clarity on the intended behavior for
> SLAs. This has come up several times in the past, but as far as I can
> tell there hasn't been a definitive answer. As pointed out in
> https://issues.apache.org/jira/browse/AIRFLOW-249 (open for several
> years now):
>
>     the SLA logic is only being fired after following_schedule + sla
>     has elapsed, in other words one has to wait for the next TI before
>     having a chance of getting any email. Also the email reports
>     dag.following_schedule time (I guess because it is close of
>     TI.start_date), but unfortunately that doesn't match what the task
>     instances shows nor the log filename
>
> Example: Consider a TI from a @daily DAG with execution date of Monday
> at 00:00. It will start executing soon after Tuesday 00:00. If I set
> the SLA to 5 minutes, I would expect an SlaMiss to be created at
> Tuesday 00:05, but it's actually not created until *Wednesday* 00:05.
>
> I find this behavior very surprising, and it seems I'm not the only
> one (see [1], [2]). Can someone confirm whether this is really the
> desired behavior?
>
> I think removing a single line [3] from the manage_slas implementation
> would bring the behavior in line with what I expected - namely, that
> an SlaMiss will be created based on:
>
>     execution_date + schedule_interval + sla
>
> ...as opposed to the current behavior of:
>
>     execution_date + (2 * schedule_interval) + sla
>
> I'd be happy to open a PR for that if we reach consensus on the
> desired behavior.
>
> Thanks,
> Andrew
>
> [1]
>
> https://stackoverflow.com/questions/44071519/how-to-set-a-sla-in-airflow?rq=1
> ,
> [2] https://issues.apache.org/jira/browse/AIRFLOW-2781
> [3]
>
> https://github.com/apache/incubator-airflow/blob/6afb12f0e5c18e8634daa0119d6e5797aa770b80/airflow/jobs/scheduler_job.py#L425
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message