Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 2AD1A2009FB for ; Fri, 6 May 2016 23:02:32 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 29611160A0C; Fri, 6 May 2016 21:02:32 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 71FC01608F8 for ; Fri, 6 May 2016 23:02:31 +0200 (CEST) Received: (qmail 22892 invoked by uid 500); 6 May 2016 21:02:30 -0000 Mailing-List: contact dev-help@airflow.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.incubator.apache.org Delivered-To: mailing list dev@airflow.incubator.apache.org Received: (qmail 22881 invoked by uid 99); 6 May 2016 21:02:30 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 May 2016 21:02:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 33C75C2ADD for ; Fri, 6 May 2016 21:02:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4.099 X-Spam-Level: X-Spam-Status: No, score=-4.099 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-2.079] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id JdvumygFrfSF for ; Fri, 6 May 2016 21:02:28 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with SMTP id 7DEC45F295 for ; Fri, 6 May 2016 21:02:27 +0000 (UTC) Received: (qmail 22770 invoked by uid 99); 6 May 2016 21:02:26 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 May 2016 21:02:26 +0000 Received: from mail-io0-f182.google.com (mail-io0-f182.google.com [209.85.223.182]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 62F741A010F for ; Fri, 6 May 2016 21:02:26 +0000 (UTC) Received: by mail-io0-f182.google.com with SMTP id f89so125580030ioi.0 for ; Fri, 06 May 2016 14:02:26 -0700 (PDT) X-Gm-Message-State: AOPr4FUecR4108bsg9MUV09iPtaLmuWHB+bUHGWiS/Pn9EMJP0kxs8ZJwKVP8gtst4z0hqFAoVpFbW1Nd/SZMQ== MIME-Version: 1.0 X-Received: by 10.107.8.19 with SMTP id 19mr27224579ioi.60.1462568545730; Fri, 06 May 2016 14:02:25 -0700 (PDT) Received: by 10.64.227.50 with HTTP; Fri, 6 May 2016 14:02:25 -0700 (PDT) In-Reply-To: References: <2DD52EF7-8A30-4B80-940E-9FB119C7F868@gmail.com> Date: Fri, 6 May 2016 14:02:25 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: 1.7.1 release blocker: [AIRFLOW-56] Airflow's scheduler can "lose" queued tasks From: Chris Riccomini To: dev@airflow.incubator.apache.org Content-Type: multipart/alternative; boundary=001a113eae10a0d0f9053232c6ed archived-at: Fri, 06 May 2016 21:02:32 -0000 --001a113eae10a0d0f9053232c6ed Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Same :) On Fri, May 6, 2016 at 1:57 PM, Arthur Wiedmer wrote: > Good, because that's the version we use :) > > On Fri, May 6, 2016 at 1:55 PM, Bolke de Bruin wrote: > > > It seems it does: https://dev.mysql.com/doc/refman/5.6/en/select.html > > > > Bolke > > > > Sent from my iPhone > > > > On 6 mei 2016, at 22:52, Chris Riccomini wrote: > > > > >> Postgres, Mysql >=3D 5.7) > > > > > > Curious--does this work in MySQL 5.6? > > > > > >> On Fri, May 6, 2016 at 1:36 PM, Bolke de Bruin > > wrote: > > >> > > >> Dear All, > > >> > > >> One of the remaining issues to resolve is Airflow-56: the fact that > > >> airflow=E2=80=99s scheduler can =E2=80=9Close=E2=80=9D queued tasks.= One of the recent changes > > was > > >> the hold queue status in the executor instead using the database. > While > > >> this goes against the =E2=80=9Cdatabase knows the state=E2=80=9D, it= was done (afaik) > to > > >> prevent a race condition when a scheduler might actually set the tas= k > > twice > > >> for execution as the task might not have updated its state in the > > database > > >> yet. However, if the executor loses its queue for some reason the > tasks > > >> that have a state of queued in the database will never get picked up= . > > >> Airflow-56 contains a DAG that will exhibit this behavior on clean > > install > > >> of 1.7.1rc3 with mysql and local executor. > > >> > > >> Jeremiah has worked on > > >> https://github.com/apache/incubator-airflow/pull/1378, that basicall= y > > >> reverts the behavior to using the database (pre 1.7.0), but contains > > some > > >> additional fixes. In addition I have asked him to include the > following: > > >> > > >> @provide_session > > >> def refresh_from_db(self, session=3DNone, lock_for_update=3DFalse): > > >> """ > > >> Refreshes the task instance from the database based on the primar= y > > key > > >> """ > > >> TI =3D TaskInstance > > >> > > >> if lock_for_update: > > >> ti =3D session.query(TI).filter( > > >> TI.dag_id =3D=3D self.dag_id, > > >> TI.task_id =3D=3D self.task_id, > > >> TI.execution_date =3D=3D self.execution_date, > > >> ).with_for_update().first() > > >> else: > > >> ti =3D session.query(TI).filter( > > >> TI.dag_id =3D=3D self.dag_id, > > >> TI.task_id =3D=3D self.task_id, > > >> TI.execution_date =3D=3D self.execution_date, > > >> ).first() > > >> > > >> And to use =E2=80=9Clock_for_update=3DTrue=E2=80=9D in TaskInstance.= run. This locks the > > >> record for update (with a FOR UPDATE) in the database (Postgres, Mys= ql > > >=3D > > >> 5.7) and ensures a =E2=80=9Cselect=E2=80=9D needs to wait if it incl= udes this record > in > > its > > >> results. As soon as a =E2=80=9Csession.commit()=E2=80=9D occurs the = lock is gone. The > > >> window of for the race condition to occur is this much shorter now > (the > > >> time between a new scheduler run and the time it takes for the pytho= n > > >> interpreter to start =E2=80=9Cairflow run=E2=80=9D) though not entir= ely closed. > > >> > > >> Due to the lock their *might* be a small performance hit, but it > should > > be > > >> very small: only queries that include the aforementioned locked reco= rd > > will > > >> be delayed for a few milliseconds. However, we might overlook > something > > so > > >> I kindly request to review this PR, so we can make it part of 1.7.1. > > >> > > >> Thanks > > >> Bolke > > >> > > >> > > >> > > > --001a113eae10a0d0f9053232c6ed--