From dev-return-7632-archive-asf-public=cust-asf.ponee.io@airflow.apache.org Thu Feb 14 19:53:04 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id D047D180626 for ; Thu, 14 Feb 2019 20:53:03 +0100 (CET) Received: (qmail 74895 invoked by uid 500); 14 Feb 2019 19:53:02 -0000 Mailing-List: contact dev-help@airflow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.apache.org Delivered-To: mailing list dev@airflow.apache.org Received: (qmail 74875 invoked by uid 99); 14 Feb 2019 19:53:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Feb 2019 19:53:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id BF783C030C for ; Thu, 14 Feb 2019 19:53:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.202 X-Spam-Level: X-Spam-Status: No, score=-0.202 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id a7B34847SCnb for ; Thu, 14 Feb 2019 19:52:59 +0000 (UTC) Received: from mail-it1-f179.google.com (mail-it1-f179.google.com [209.85.166.179]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 808F66246C for ; Thu, 14 Feb 2019 19:52:59 +0000 (UTC) Received: by mail-it1-f179.google.com with SMTP id z131so7771331itf.5 for ; Thu, 14 Feb 2019 11:52:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=SZ7PUAHSYJ1c4ZQNJx4DCPifEAFDpNC67mCdWk8+21w=; b=Sm+eQYVSfxu9+qW0Ime2Pu0b33zM1ScWNkMNsXT5Vq5zmVtS0xW8zRcn9jovAl8lRC 0t4w27iygmD9EwdzitIGAlY/rVb8V5xOK1lcMkAXjPZhRr5tJoWR72pG0WWz+j1YBcCE ZcPyd84f19NX5/BNIFQ2eF2kgIl7uw7a6ikJE/WnjycKACDlMctD/KfGX/WrVClk1Pry ZS4wvg8FOkNHY3pyWB96IcCNghDI2LfzwBRE1e4TSPkLui0Ib1TZyvqbPlw0GY+I1VnL DNZSBnRAgyTs/mjVT0HCrJJ945XMH+lLJY/Z5wsEE43yCTgVJI0hsjm0KuCvd48ORhQc uYeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=SZ7PUAHSYJ1c4ZQNJx4DCPifEAFDpNC67mCdWk8+21w=; b=OsDMWsPmylDw/wCHwMqIwG/oDWYDUXn2Iav7qCCzw4ur0+bHII0tvPkVZb/2Jgyh/L 3QOH0tBszfjxGeDxZ+y7dB35kdv+l3vr8F0qKW4B90cL9sK5I0Yd9zdJDkQcCAdRSkES KKTwnPaBWGa1ORDZGRtyub6az/QK0BeOhTqEc4CYO/Mc6di076/0IdYibkAGDi0eymrC yAVnrwYEaGm5wri4nkmQC3rhFJhZMLJltPbopydSxfix/EZiHVUf4TMs9OdPCAt+XuHf 9zllA+MTgZ7J57bx8Rsvtaj29nEJWYrMRRwDYrbdcAYrM06J0qI572mnBTW7hA8Bmbk6 VX9w== X-Gm-Message-State: AHQUAuYTed2Pg8SQdxmwHubImKZpzMOk7VfBOhLMr/uSAloFi8K0+j9I p9YvKvLgKEdXY8Whk1RKM1b+eCHxSx4ckhGWDjTGYvLf X-Google-Smtp-Source: AHgI3IbJYH1u1Xwe3T9GyiuKUEs/c9RGWW0FaH9d22kinztXrCIkTxMiwb7Bhe3qQvyWDc/QK22bSwmArs1INRQUVPg= X-Received: by 2002:a5e:a611:: with SMTP id q17mr3110863ioi.17.1550173972936; Thu, 14 Feb 2019 11:52:52 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Dan Stoner Date: Thu, 14 Feb 2019 14:52:41 -0500 Message-ID: Subject: Re: 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run. To: dev@airflow.apache.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable More info! It appears that the Celery executor will silently fail if the credentials to a postgres results_backend are not valid. For example, we see: [2019-02-13 20:45:21,132] {{models.py:1353}} INFO - Dependencies not met for , dependency 'Task Instance Not Already Running' FAILED: Task is already running, it started on 2019-02-13 20:45:09.088978+00:00. [2019-02-13 20:45:21,132] {{models.py:1353}} INFO - Dependencies not met for , dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run. [2019-02-13 20:45:21,135] {{logging_mixin.py:95}} INFO - [2019-02-13 20:45:21,134] {{jobs.py:2514}} INFO - Task is not able to be run but no database connection failure anywhere in the logs. After fixing our connection string (via AIRFLOW__CELERY__RESULT_BACKEND or result_backend in airflow.cfg), these issues went away. Sorry I cannot produce a more solid bug report but hopefully this is a breadcrumb for someone. Dan Stoner On Wed, Feb 13, 2019 at 10:16 PM Dan Stoner wrote: > > We saw this but the task instance state was generally "SUCCESS". > > In our case, we thought it was due to Redis being used as the results > store. There is a WARNING against this right in the operational logs. > Google Cloud Composer is surprisingly setup in this fashion. > > We went back to running our own infrastructure and using postgres as > the results store, those issues have not occurred since. > > The real downside we saw to this error was that our workers were > highly underutilized, we were getting terrible overall data > throughput, and the workers kept trying to run these tasks they > couldn't actually run. > > - Dan Stoner > > > On Wed, Feb 13, 2019 at 4:16 PM Kevin Lam wrote: > > > > Friendly ping on the above! Has anyone encountered this by chance? > > > > We're still seeing it occasionally on longer running tasks. > > > > On Tue, Nov 20, 2018 at 10:31 AM Kevin Lam wrot= e: > > > > > Hi, > > > > > > We run Apache Airflow in Kubernetes in a manner very similar to what = is > > > outlined in puckel/docker-airflow [1] (Celery Executor, Redis for > > > messaging, Postgres). > > > > > > Lately, we've encountered some of our Tasks getting stuck in a runnin= g > > > state, and printing out the errors: > > > > > > [2018-11-20 05:31:23,009] {models.py:1329} INFO - Dependencies not me= t for , depe= ndency 'Task Instance Not Already Running' FAILED: Task is already running,= it started on 2018-11-19 23:29:11.974497+00:00. > > >> [2018-11-20 05:31:23,016] {models.py:1329} INFO - Dependencies not m= et for , dep= endency 'Task Instance State' FAILED: Task is in the 'running' state which = is not a valid state for execution. The task must be cleared in order to be= run. > > >> > > >> > > > Is there anyway to avoid this? Does anyone know what causes this issu= e? > > > > > > This is quite problematic. The task is stuck in running state without > > > making any progress when the above error occurs, and so turning on re= tries > > > on doesn't help with getting our DAGs to reliably run to completion. > > > > > > Thanks! > > > > > > [1] https://github.com/puckel/docker-airflow > > >