From dev-return-8776-archive-asf-public=cust-asf.ponee.io@airflow.apache.org Wed Jun 26 19:45:27 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 74B0518064D for ; Wed, 26 Jun 2019 21:45:27 +0200 (CEST) Received: (qmail 60618 invoked by uid 500); 26 Jun 2019 19:45:25 -0000 Mailing-List: contact dev-help@airflow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.apache.org Delivered-To: mailing list dev@airflow.apache.org Received: (qmail 60606 invoked by uid 99); 26 Jun 2019 19:45:25 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Jun 2019 19:45:25 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id B10881A4196 for ; Wed, 26 Jun 2019 19:45:24 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.8 X-Spam-Level: * X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id aeyDDKXW14oN for ; Wed, 26 Jun 2019 19:45:23 +0000 (UTC) Received: from mail-io1-f53.google.com (mail-io1-f53.google.com [209.85.166.53]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 32B7A5FDE4 for ; Wed, 26 Jun 2019 19:45:22 +0000 (UTC) Received: by mail-io1-f53.google.com with SMTP id e3so7651825ioc.12 for ; Wed, 26 Jun 2019 12:45:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=Hi0bOEGmpaTzVH1RCHGaYEOdMa2DIZiA0qW/k4gD+yc=; b=ateFaeLR5ziKMZYXSahZlWI9bTP2coTdmvKG7wztlaOayDh5dyhYM2roUPge/W7MhH PFE++Dxs7V4lXo0WdzXTjHjChc/NJ5i+tk7yJoZ+rApnV5IKyadpC31jhuy+wC2aa4QI vBjhdBCD5F64RNAE7IaDBEjOO0/DCA7t4qRsAXV56YUDykDx3XKgnsQCQNB6ubQPHaH9 Su2crNAq481Uo3mTkbP2UOyM3wf8v0NGyqW+R+0z6qTg1WKD2DIIoI3YfxSFc2d8/qhT +3BE+wAgAJuhIYSJCCjgqT26b4x7wmYnzGY6TzKTAXUgnwevYCLSO92GPFSC7uYBFIok z5Eg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=Hi0bOEGmpaTzVH1RCHGaYEOdMa2DIZiA0qW/k4gD+yc=; b=U2qN8VSY0BfcyGZ9fSt6MNcgB/J2y29O/PVdn7tqeuo38JwKYmmLcyTAfJnjb4yZ38 tbjXFDGNsa/eZ+vwgsSqAydyW0vJRoDV1VtcmBO67EZ9C/7kJ6AL4nKfwMDuw9r+QmqX 5JW89z1eXXtWxkGLRWhccq+/YJALQHKrs6dbZeXxBKU1u4rsqw/rNQw66gkouTM1hqWi EyJDAhBo8g21imavT1x8WAA4fcpqWiJjeGxZwjZcHGvDRULioxbwiLTbIA6SzY/+M2Xe ZWB8afdZpWbhX5xNLFG2WLpMTrq0PSgQn3LGHx4IMBJ8wFgHiBPr/2MsAHbJwbC0yY9x yfXA== X-Gm-Message-State: APjAAAU1juk0a2hu5GXvAVCeI5HolZCm82j5umv8eNOwcElzYfWv//uB Gs7x3FR0AeWaax+HKvQ0+hvm/9S+681+UCRANAEvqi1f X-Google-Smtp-Source: APXvYqxwuZsQrBweBWLe+PEEjGxbojN5cecDE+wZE07qbrCZc9RV3M7ZG+kN40mZeLMw+kNnMKL7vG4dP9Kkje+w3zk= X-Received: by 2002:a6b:9257:: with SMTP id u84mr6873882iod.278.1561578320790; Wed, 26 Jun 2019 12:45:20 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Gerard Toonstra Date: Wed, 26 Jun 2019 21:45:10 +0200 Message-ID: Subject: Re: SLA semantics To: dev@airflow.apache.org Content-Type: multipart/alternative; boundary="000000000000198717058c3f49b4" --000000000000198717058c3f49b4 Content-Type: text/plain; charset="UTF-8" That's not my experience of how SLA's work at the moment. I've observed this to currrently work as: 1. An SLA is configured as the "time delta" after some dag execution schedule. 2. The SLA is configured at task level, so any tasks still running or need to run after "time delta" will be aggregated together in one "SLA email". 3. The email is sent only once at the time the SLA misses in the "dag run". 4. The email is sent by the scheduler, not some worker. What I did notice: * If the scheduler cannot contact an email server, it will delay the scheduler loop. * As the emails do not get sent, it will try again next time the dag configured with an SLA gets parsed, thus again impacting the scheduler loop. * If the SLA emails do not succeed and later on they do, you get a huge email with everything combined. What we decided is not to rely on airflow SLA's, but to enforce and detect SLA's externally based on success/fail metadata that we receive from airflow. The rationale is: * we want to get better insights when workflows (dags) are completed anyway, so we wanted dag completion data available elsewhere outside the airflow db., * we want to avoid any negative impact on the main scheduler loop due to mailing system availability. On Wed, Jun 26, 2019 at 9:18 PM Andrew Stahlman wrote: > Hi all, > > I'm looking to get some clarity on the intended behavior for > SLAs. This has come up several times in the past, but as far as I can > tell there hasn't been a definitive answer. As pointed out in > https://issues.apache.org/jira/browse/AIRFLOW-249 (open for several > years now): > > the SLA logic is only being fired after following_schedule + sla > has elapsed, in other words one has to wait for the next TI before > having a chance of getting any email. Also the email reports > dag.following_schedule time (I guess because it is close of > TI.start_date), but unfortunately that doesn't match what the task > instances shows nor the log filename > > Example: Consider a TI from a @daily DAG with execution date of Monday > at 00:00. It will start executing soon after Tuesday 00:00. If I set > the SLA to 5 minutes, I would expect an SlaMiss to be created at > Tuesday 00:05, but it's actually not created until *Wednesday* 00:05. > > I find this behavior very surprising, and it seems I'm not the only > one (see [1], [2]). Can someone confirm whether this is really the > desired behavior? > > I think removing a single line [3] from the manage_slas implementation > would bring the behavior in line with what I expected - namely, that > an SlaMiss will be created based on: > > execution_date + schedule_interval + sla > > ...as opposed to the current behavior of: > > execution_date + (2 * schedule_interval) + sla > > I'd be happy to open a PR for that if we reach consensus on the > desired behavior. > > Thanks, > Andrew > > [1] > > https://stackoverflow.com/questions/44071519/how-to-set-a-sla-in-airflow?rq=1 > , > [2] https://issues.apache.org/jira/browse/AIRFLOW-2781 > [3] > > https://github.com/apache/incubator-airflow/blob/6afb12f0e5c18e8634daa0119d6e5797aa770b80/airflow/jobs/scheduler_job.py#L425 > --000000000000198717058c3f49b4--