Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 938F9200CDA for ; Fri, 4 Aug 2017 19:25:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 91F3916DF92; Fri, 4 Aug 2017 17:25:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AE27B16DF8B for ; Fri, 4 Aug 2017 19:25:10 +0200 (CEST) Received: (qmail 82228 invoked by uid 500); 4 Aug 2017 17:25:08 -0000 Mailing-List: contact dev-help@airflow.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.incubator.apache.org Delivered-To: mailing list dev@airflow.incubator.apache.org Received: (qmail 82215 invoked by uid 99); 4 Aug 2017 17:25:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Aug 2017 17:25:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id E98D1C2B48 for ; Fri, 4 Aug 2017 17:25:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.381 X-Spam-Level: ** X-Spam-Status: No, score=2.381 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=cloverhealth.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id v7hhGJlZ40YD for ; Fri, 4 Aug 2017 17:25:00 +0000 (UTC) Received: from mail-oi0-f42.google.com (mail-oi0-f42.google.com [209.85.218.42]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 617785FCB9 for ; Fri, 4 Aug 2017 17:24:59 +0000 (UTC) Received: by mail-oi0-f42.google.com with SMTP id x3so20949148oia.1 for ; Fri, 04 Aug 2017 10:24:59 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=elFvjPL9JpaQZwGT6pFr3B+iiKQNXZbvUgBcqemLo9Q=; b=R24K6/Rht6PKJbwj0OdeU8ZlgG+z3WCaLJX80LnnvQO41Dpp8mdvL6rCAWTk9/DaJ/ fF4h4/mUg3w29L7FxBiuvhvYXr+HNxfzfH5104o0qr8ZTJLwhf4jstGzxmMvy9260CBQ 4hqOyT5v3zYsBMC/+bT2lCbeWHWEN4h+VamOdr7yX8tHjkYigISWzuA00NJT62TO5a7Z nlzyXYhkt5nex6QVHqYRQG73XXrOILWD6Oad5mBGABG4Gt3W8cXXe0zMEqHkdy4MEJ9O /wgmlnETrxeQoV6BRdrljn+JzXYA0+GG2JNaf2ofkPud9DeadP4W5CT6wGQXbvLuxo2Q VKRQ== X-Gm-Message-State: AHYfb5isYb9/dReQS1EHsfCsxglAqJXyr4xcAYleekJMjmP2p04PkQLm LS4kvrKwbXKBzqpsIowJvLfOwuaV+RdZmr4= X-Received: by 10.202.187.194 with SMTP id l185mr2112581oif.228.1501867492167; Fri, 04 Aug 2017 10:24:52 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: George Leslie-Waksman Date: Fri, 04 Aug 2017 17:24:41 +0000 Message-ID: Subject: Re: Completed tasks not being marked as completed To: dev@airflow.incubator.apache.org Content-Type: multipart/alternative; boundary="001a113cf1f65ef3330555f0c6fa" archived-at: Fri, 04 Aug 2017 17:25:11 -0000 --001a113cf1f65ef3330555f0c6fa Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Pretty sure (not 100%) what is happening is: 1. Scheduler bugs result in task getting scheduled twice 2. Worker 1 grabs task 3. Worker 2 grabs task 4. Worker 1 starts task 5. Worker 2 starts task 6. Worker 2 sees that Worker 1 has started and plans to abort 7. Worker 1 finishes and marks task as done 8. Worker 2 finishes aborting and marks task as not done On Fri, Jul 28, 2017 at 3:50 PM Marc Weil wrote: > Hey Max, > > Thanks for the suggestions. I believe it was a retry (I'm using remote > logging so I can only check after the task completes), but the UI never > reported it as such. The latest_heartbeat column is in the jobs table, an= d > interestingly I do see some running jobs that haven't heartbeated for ~22 > minutes. They are LocalTaskJob instances with CeleryExecutor properly > listed as the executory class. I can't really correlate these to a specif= ic > task instance, however, as there doesn't appear to be any key written to > the jobs table (the dag_id column is all null, and there's no task_id > column or anything). > > Any ideas on what could be making these tasks stop heartbeating regularly= ? > That could explain why eventually (after an overnight period of time) > everything is marked as finished in the Airflow UI: eventually these task= s > do heartbeat again, but quite long after they are finished running. > > Thanks again! > =E1=90=A7 > > -- > Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement = | > New Relic > > On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin < > maximebeauchemin@gmail.com> wrote: > > > Are you sure there hasn't been a retry at that point? [One of] the > expected > > behavior is the one I described, where if a task finished without > reporting > > it's success [or failure], it will stay marked as RUNNING, but will fai= l > to > > emit a heartbeat (which is a timestamp updated in the task_instance tab= le > > as last_heartbeat or something). The scheduler monitors for RUNNING > tasks > > without heartbeat and eventually will handle the failure (send emails, > call > > on_failure_callback, ...). > > > > Looking for heartbeat in the DB might give you some clues as to what is > > going on. > > > > Also there have been versions where we'd occasionally see double > > triggering, and double firing, which can be confusing. Then you can hav= e > > different processes reporting their status and debugging those issues c= an > > be problematic. I think there's good prevention against that now, using > > database transactions as the task instance sets itself as RUNNING. I'm > not > > sure if 1.8.0 is 100% clean from that regard. > > > > Max > > > > On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil wrote: > > > > > It happens mostly when the scheduler is catching up. More specificall= y, > > > when I load a brand new DAG with a start date in the past. Usually I > have > > > it set to run 5 DAG runs at the same time, and up to 16 tasks at the > same > > > time. > > > > > > What I've also noticed is that the tasks will sit completed in realit= y > > but > > > uncompleted in the Airflow DB for many hours, but if I just leave the= m > > all > > > sitting there over night they all tend to be marked complete the next > > > morning. Perhaps this points to some sort of Celery timeout or > connection > > > retry interval? > > > =E1=90=A7 > > > > > > -- > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and > Engagement > > | > > > New Relic > > > > > > On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin < > > > maximebeauchemin@gmail.com> wrote: > > > > > > > By the time "INFO - Task exited with return code 0" gets logged, th= e > > task > > > > should have been marked as successful by the subprocess. I have no > > > specific > > > > intuition as to what the issue may be. > > > > > > > > I'm guessing at that point the job stops emitting heartbeat and > > > eventually > > > > the scheduler will handle it as a failure? > > > > > > > > How often does that happen? > > > > > > > > Max > > > > > > > > On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil > wrote: > > > > > > > > > From what I can tell, it only affects CeleryExecutor. I've never > seen > > > > this > > > > > behavior with LocalExecutor before. > > > > > > > > > > Max, do you know anything about this type of failure mode? > > > > > =E1=90=A7 > > > > > > > > > > -- > > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and > > > Engagement > > > > | > > > > > New Relic > > > > > > > > > > On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson < > thejonas@gmail.com> > > > > > wrote: > > > > > > > > > > > We have the exact same problem. In our case, it's a bash operat= or > > > > > starting > > > > > > a docker container. The container and process it ran exit, but > the > > > > > 'docker > > > > > > run' command is still showing up in the process table, waiting > for > > an > > > > > > event. > > > > > > I'm trying to switch to LocalExecutor to see if that will help. > > > > > > > > > > > > _jonas > > > > > > > > > > > > > > > > > > On Thu, Jul 27, 2017 at 4:28 PM Marc Weil > > > wrote: > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > Has anyone seen the behavior when using CeleryExecutor where > > > workers > > > > > will > > > > > > > finish their tasks ("INFO - Task exited with return code 0" > shows > > > in > > > > > the > > > > > > > logs) but are never marked as complete in the airflow DB or U= I? > > > > > > Effectively > > > > > > > this causes tasks to hang even though they are complete, and > the > > > DAG > > > > > will > > > > > > > not continue. > > > > > > > > > > > > > > This is happening on 1.8.0. Anyone else seen this or perhaps > > have a > > > > > > > workaround? > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > -- > > > > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and > > > > > Engagement > > > > > > | > > > > > > > New Relic > > > > > > > =E1=90=A7 > > > > > > > > > > > > > > > > > > > > > > > > > > > > --001a113cf1f65ef3330555f0c6fa--