From dev-return-9058-archive-asf-public=cust-asf.ponee.io@airflow.apache.org Wed Jul 31 17:25:55 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id A06D418062B for ; Wed, 31 Jul 2019 19:25:55 +0200 (CEST) Received: (qmail 79241 invoked by uid 500); 31 Jul 2019 17:25:53 -0000 Mailing-List: contact dev-help@airflow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.apache.org Delivered-To: mailing list dev@airflow.apache.org Received: (qmail 79185 invoked by uid 99); 31 Jul 2019 17:25:53 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Jul 2019 17:25:53 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 79BEA182906 for ; Wed, 31 Jul 2019 17:25:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.05 X-Spam-Level: ** X-Spam-Status: No, score=2.05 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 9nX8JcpWej-8 for ; Wed, 31 Jul 2019 17:25:50 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::d43; helo=mail-io1-xd43.google.com; envelope-from=fengtao04@gmail.com; receiver= Received: from mail-io1-xd43.google.com (mail-io1-xd43.google.com [IPv6:2607:f8b0:4864:20::d43]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 5F4717DC04 for ; Wed, 31 Jul 2019 17:25:49 +0000 (UTC) Received: by mail-io1-xd43.google.com with SMTP id k20so137986778ios.10 for ; Wed, 31 Jul 2019 10:25:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=92igR71UeAttifKgBtXxurp7qKVR6/+MNRYDYhwPu3s=; b=ORn7k5fJ8kmK0wvI4pws4ANfWMtJMQWBrmsCHX+TK0ASTUeGspaWyTVHCSIlxLVv+C QGuLkRTc/C2sRArL3XAVBzT9dUSA5BtqDeN4JaSPECImSWALtZZp7Oj2maKFoEA+DIxQ dN7//xmht5uBupPVIM0fZJqeBkHjvdbUqXSTJWrxD0j9kALgKcNjoA5AqhmztX+DigVR 1019jDZiuJX/e1Z7ACtCWmxoM6pTHFOI26tvpQLOsQjhZvK32WeBgpm48h5Chz3EIVxe UKOc0pVi+J4uSF6Wxry7c/XaOA8hrcLY/NLlfldqQe0UefOTjO92ry8XP7G408uTS5h1 wtLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=92igR71UeAttifKgBtXxurp7qKVR6/+MNRYDYhwPu3s=; b=ZA63s3ORF+AmKU/UocwW6STBJH9qH291Yrv6FV0BYEqYt5/pZJ5XFdQecx+yRD8Qh1 U2tqxF0PuFksFazKE7v+tR6lBVxHIt74Ftjto7C30MdHtQ9LjkKbpVMmG/aIQdk62NSm 1yNnOjUPqf/s4PTe77ztqs0wFd8KYVy2uGhcrNf0sJKkcnva4HmxM/+AXGNwo18hymWR pntAdG/zJBuBMZ/KY/bmnsv2rXDecGAmtnK07F+BW6BdFOYwonFVMOQp+7aHdIT9p8m3 jzMZJe6O76lq2aiYsVBwHPAWqTSAzLSTzuFI+GADLEaX4b+pQr+TJvw4EKqEaXlyMflz Uc7Q== X-Gm-Message-State: APjAAAUCozLcrVSIttZ64LGblTXtECiWE3dND7dS+HjyKmwlEAMfm2of TyVesiYt+UmPy2tvr4WwvIp7VPk1NwLCjwnnUJaZ3056 X-Google-Smtp-Source: APXvYqxwIS6VKWRDbva1AFrQq8KBG43mHGuS0PfRnRlgKR2XPphVZKoGktY2mZksiKBsiSaqI6Ci32sl2FI/WMfDFXg= X-Received: by 2002:a02:c6a9:: with SMTP id o9mr72040016jan.90.1564593947901; Wed, 31 Jul 2019 10:25:47 -0700 (PDT) MIME-Version: 1.0 References: <9BA9748A-C2EF-4BF7-BCFE-D3621229F4FE@apache.org> In-Reply-To: <9BA9748A-C2EF-4BF7-BCFE-D3621229F4FE@apache.org> From: Tao Feng Date: Wed, 31 Jul 2019 10:25:11 -0700 Message-ID: Subject: Re: Removal of "run_duration" and its impact on orphaned tasks To: dev@airflow.apache.org Content-Type: multipart/alternative; boundary="0000000000007b83e2058efd6a2f" --0000000000007b83e2058efd6a2f Content-Type: text/plain; charset="UTF-8" Late in the game as I was't aware of `run_duration` option been removed. But just want to point out that Lyft also did very similar with James' setup that we run the scheduler in a fix internal instead of infinite loop and let the runit/supervisor to restart the scheduler process. This is to solve: 1. orphaned tasks not getting clean up successfully when it runs on infinite loop; 2. Make sure stale / deleted DAG will get clean up( https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1438 ?) properly. I think if it goes with removing this option and let scheduler runs in an infinite loop, we need to change the schedule loop to handle the clean up process if it hasn't been done. On Wed, Jul 31, 2019 at 10:10 AM Ash Berlin-Taylor wrote: > Thanks for testing this out James, shame to discover we still have > problems in that area. Do you have an idea of how many tasks per day we are > talking about here? > > Your cluster schedules quite a large number of tasks over the day (in the > 1k-10k range?) right? > > I'd say whatever causes a task to become orphaned _while_ the scheduler is > still running is the actual bug, and running the orphan detection more > often may just be replacing one patch (the run duration) with another one > (running the orphan detection more than at start up). > > -ash > > > On 31 Jul 2019, at 16:43, James Meickle > wrote: > > > > In my testing of 1.10.4rc3, I discovered that we were getting hit by a > > process leak bug (which Ash has since fixed in 1.10.4rc4). This process > > leak was minimal impact for most users, but was exacerbated in our case > by > > using "run_duration" to restart the scheduler every 10 minutes. > > > > To mitigate that issue while remaining on the RC, we removed the use of > > "run_duration", since it is deprecated as of master anyways: > > > https://github.com/apache/airflow/blob/master/UPDATING.md#remove-run_duration > > > > Unfortunately, testing on our cluster (1.10.4rc3 plus no "run_duration") > > has revealed that while the process leak issue was mitigated, that we're > > now facing issues with orphaned tasks. These tasks are marked as > > "scheduled" by the scheduler, but _not_ successfully queued in Celery > even > > after multiple scheduler loops. Around ~24h after last restart, we start > > having enough stuck tasks that the system starts paging and requires a > > manual restart. > > > > Rather than generic "scheduler instability", this specific issue was one > of > > the reasons why we'd originally added the scheduler restart. But it > appears > > that on master, the orphaned task detection code still only runs on > > scheduler start despite removing "run_duration": > > > https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1328 > > > > Rather than immediately filing an issue I wanted to inquire a bit more > > about why this orphan detection code is only run on scheduler start, > > whether it would be safe to send in a PR to run it more often (maybe a > > tunable parameter?), and if there's some other configuration issue with > > Celery (in our case, backed by AWS Elasticache) that would cause us to > see > > orphaned tasks frequently. > > --0000000000007b83e2058efd6a2f--