Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E6D7D200C44 for ; Mon, 27 Mar 2017 21:27:41 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id E59A2160B85; Mon, 27 Mar 2017 19:27:41 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B3257160B7B for ; Mon, 27 Mar 2017 21:27:40 +0200 (CEST) Received: (qmail 33937 invoked by uid 500); 27 Mar 2017 19:27:39 -0000 Mailing-List: contact dev-help@airflow.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airflow.incubator.apache.org Delivered-To: mailing list dev@airflow.incubator.apache.org Received: (qmail 33921 invoked by uid 99); 27 Mar 2017 19:27:39 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Mar 2017 19:27:39 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id D66EF180A02 for ; Mon, 27 Mar 2017 19:27:38 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.147 X-Spam-Level: X-Spam-Status: No, score=-0.147 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, KAM_INFOUSMEBIZ=0.75, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.796, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=collectivehealth.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id YXnxQZUCegxY for ; Mon, 27 Mar 2017 19:27:34 +0000 (UTC) Received: from mail-wr0-f174.google.com (mail-wr0-f174.google.com [209.85.128.174]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 870525FB48 for ; Mon, 27 Mar 2017 19:27:33 +0000 (UTC) Received: by mail-wr0-f174.google.com with SMTP id l43so71863115wre.1 for ; Mon, 27 Mar 2017 12:27:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=collectivehealth.com; s=ch; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=rfVpfiJr/h85QHRD9W5gMX7zteJMXZ+cdHIW/Up8i4w=; b=Qbx2bWyRm+dlPQZHNyXinTOarsPLn6u+DYTQFaOa9/4Uy1bRWTRnIWEZCnacCPXGsi GQeudtWZ0Eqh6OdWq3eW+kvp0jJB8tK+1L6XHr6YVXFbQlqFZi4M3NLY+5fMYuB1lQxG nd63MaBmZvESZX22BtQt1gww3Gzc9j5+M3ozM= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=rfVpfiJr/h85QHRD9W5gMX7zteJMXZ+cdHIW/Up8i4w=; b=ZLfLzAU9TujuwYisXvW+v67NsnEd+aE8bj5riJcArNXurHux1dqRHF4Okj42Kwyz62 2VGVhiRYnoCGsAgEGgcVgHHLhJNnjSGAa92BTIp9L+S11EYIun9KZjD9AfD2QFONNsdi wZ3BYeIvNqQTBiSnFYPhGJcnFt/63g83+1SMpaXKvhuBuSZcA+GZYwkd+dVRT7PNLwtb MCPlTgzbDNTZQr8c9Wm4EIa8+E6iP/xzjTjFvnaEXLMhhjuoOxx4p3DKdQTo0XHX+5Rc J0jGsgNwjoYtY7T0g7va8XPcFXlCQ4My7WL8Wi4Ub8Ic20dcCwUNTx9n4Y+uBDmozCa7 6DAA== X-Gm-Message-State: AFeK/H0Dohz6zSNizqYbhX0SCBXyGc7NF8AgDrzXiS4AVk5kPcvSaAXsIaZoxRBmFgmkIRIcVX+zlHld0upCoEYkmahskoNLCq+sHKj3nuOYgVhHn5qVQ+A8ZIXZnssZr3DJq5U3qxAdqH4LpUoghYz9VxxPKCdoCQ== X-Received: by 10.28.148.143 with SMTP id w137mr11129212wmd.72.1490642850854; Mon, 27 Mar 2017 12:27:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.139.203 with HTTP; Mon, 27 Mar 2017 12:26:49 -0700 (PDT) In-Reply-To: References: <86efb405819a97a5732a89c811889ed7@qrmedia.com> <750C73F0-8F92-4A7D-8DDE-4BA92F8F55FA@quartethealth.com> <9FBAD372-FBCC-4C40-BD2E-622986C7A112@gmail.com> <1FD215B0-2F32-4F44-8814-CBB766004631@gmail.com> <7D333886-06C1-47DD-AF74-4F6142912511@gmail.com> <641BD719-CDC4-4D3D-8287-A6804F93378B@gmail.com> From: Nicholas Hodgkinson Date: Mon, 27 Mar 2017 12:26:49 -0700 Message-ID: Subject: Re: Scheduler silently dies To: Airflow Dev List Content-Type: multipart/alternative; boundary=001a114b2f129cfe17054bbb554e archived-at: Mon, 27 Mar 2017 19:27:42 -0000 --001a114b2f129cfe17054bbb554e Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable from airflow.cfg: [core] ... executor =3D LocalExecutor parallelism =3D 32 dag_concurrency =3D 16 dags_are_paused_at_creation =3D True non_pooled_task_slot_count =3D 128 max_active_runs_per_dag =3D 16 ... Pretty much the defaults; I've never tweaked these values. -N nik.hodgkinson@collectivehealth.com On Mon, Mar 27, 2017 at 12:12 PM, Gerard Toonstra wrote: > So looks like the localworkers are dying. Airflow does not recover from > that. > > > In SchedulerJob (jobs.py), you can see the "_execute_helper" function. > This calls "executor.start()", which is implemented > in local_executor.py in your case. > > The LocalExecutor is thus an object owned by the SchedulerJob. This > executor creates x (parallellism) LocalWorkers, > which derive from a multiprocessing.Process class. So the processes you s= ee > "extra" on the scheduler are those LocalWorkers > as child processes. The LocalWorkers create additional processes through = a > shell ("subprocess.check_call" with (shell=3DTrue)), > which are the things doing the actual work. > > > Before that, on my 'master' here, the LocalWorker issues a * > self.logger.info > ("{} running {}" *, which you can find in the > general > output of the scheduler log file. When starting the scheduler with "airfl= ow > scheduler", it's what gets printed on the console and starts > with "Starting the scheduler". That is the file you want to investigate. > > If anything bad happens with general processing, then it prints a: > > self.logger.error("failed to execute task > {}:".format(str(e))) > > in the exception handler. I'd grep for that "failed to execute task" in t= he > scheduler log file I mentioned. > > > I'm not sure where stdout/stderr go for these workers. If the call > basically succeeded, but there were issues with the queue handling, > then I'd expect this to go to stderr instead. I'm not 100% sure if that > gets sent to the same scheduler log file or whether that goes nowhere > because of it being a child process (they're probably inherited?). > > > One further question: what's your parallellism set to? I see 22 zombies > left behind. Is that your setting? > > Let us know! > > Rgds, > > Gerard > > > > On Mon, Mar 27, 2017 at 8:13 PM, harish singh > wrote: > > > 1.8: increasing DAGBAG_IMPORT_TIMEOUT helps. I don't see the issue > > (although not sure why tasks progress has become slow? But thats not th= e > > issue we are discussing here. So I am ignoring that here) > > > > 1.7: our prod is running 1.7 and we havent seen the "defunct process" > > issue for more than a week now. But we saw something very close to what > > Nicholas provided (localexecutor, we do not use --num-runs) > > Not sure if cpu/memory limit may lead to this issue. Often when we hit > this > > issue (which stalled the pipeline), we either increased the memory and/= or > > moved airflow to a bulkier (cpu) instance. > > > > Sorry for a late reply. Was out of town over the weekend. > > > > > > > > On Mon, Mar 27, 2017 at 10:47 AM, Nicholas Hodgkinson < > > nik.hodgkinson@collectivehealth.com> wrote: > > > > > 1.7.1.3, however it seems this is still an issue in 1.8 according to > > other > > > posters. I'll upgrade today. > > > Yes, localexecutor. > > > Will remove -n 10 > > > > > > -N > > > nik.hodgkinson@collectivehealth.com > > > > > > > > > On Mon, Mar 27, 2017 at 10:40 AM, Bolke de Bruin > > > wrote: > > > > > > > Is this: > > > > > > > > 1. On 1.8.0? 1.7.1 is not supported anymore. > > > > 2. localexecutor? > > > > > > > > Your are running with N=3D10, can you try running without it? > > > > > > > > B. > > > > > > > > Sent from my iPhone > > > > > > > > > On 27 Mar 2017, at 10:28, Nicholas Hodgkinson > > > collectivehealth.com> wrote: > > > > > > > > > > Ok, I'm not sure how helpful this is and I'm working on getting > some > > > more > > > > > information, but here's some preliminary data: > > > > > > > > > > Process tree (`ps axjf`): > > > > > 1 2391 2391 2391 ? -1 Ssl 999 0:13 > > /usr/bin/python > > > > > usr/local/bin/airflow scheduler -n 10 > > > > > 2391 2435 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2436 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2437 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2438 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2439 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2440 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2441 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2442 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2443 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2444 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2454 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2456 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2457 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2458 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2459 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2460 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2461 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2462 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2463 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2464 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2465 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > 2391 2466 2391 2391 ? -1 Z 999 0:00 \_ > > > > > [/usr/bin/python] > > > > > > > > > > # gdb python 2391 > > > > > Reading symbols from python...Reading symbols from > > > > > /usr/lib/debug//usr/bin/python2.7...done. > > > > > done. > > > > > Attaching to program: /usr/bin/python, process 2391 > > > > > Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbol= s > > > from > > > > > /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done. > > > > > done. > > > > > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > > > > > 0x00007f0c1bbb9670 in ?? () > > > > > (gdb) bt > > > > > #0 0x00007f0c1bbb9670 in ?? () > > > > > #1 0x00007f0c1bf1a000 in ?? () > > > > > #2 0x00007f0c12099b45 in ?? () > > > > > #3 0x00000000032dbe00 in ?? () > > > > > #4 0x0000000000000000 in ?? () > > > > > (gdb) py-bt > > > > > (gdb) py-list > > > > > Unable to locate python frame > > > > > > > > > > I know that's not super helpful, but it's information; I've also > > tried > > > > > pyrasite, but got nothing from it of any use. This problem occurs > for > > > me > > > > > very often and I'm happy to provide a modified environment in whi= ch > > to > > > > > capture info if anyone has a suggestion. For now I need to restar= t > my > > > > > process and get my jobs running again. > > > > > > > > > > -N > > > > > nik.hodgkinson@collectivehealth.com > > > > > > > > > > > > > > > On Sun, Mar 26, 2017 at 7:48 AM, Gerard Toonstra < > > gtoonstra@gmail.com> > > > > > wrote: > > > > > > > > > >>> > > > > >>> > > > > >>> By the way, I remember that the scheduler would only spawn one = or > > > three > > > > >>> processes, but I may be wrong. > > > > >>> Right now when I start, it spawns 7 separate processes for the > > > > scheduler > > > > >>> (8 total) with some additional > > > > >>> ones spawned when the dag file processor starts. > > > > >>> > > > > >>> > > > > >> These other processes were executor processes. Hopefully with th= e > > tips > > > > >> below someone who's getting this error > > > > >> regularly can attach and dump the thread stack and we see what's > > going > > > > on. > > > > >> > > > > >> Rgds, > > > > >> > > > > >> Gerard > > > > >> > > > > >> > > > > >>> > > > > >>> On Sun, Mar 26, 2017 at 3:21 AM, Bolke de Bruin < > bdbruin@gmail.com > > > > > > > >> wrote: > > > > >>> > > > > >>>> I case you *think* you have encountered a schedule *hang*, > please > > > > >> provide > > > > >>>> a strace on the parent process, provide process list output th= at > > > shows > > > > >>>> defunct scheduler processes, and provide *all* logging (main > logs, > > > > >>>> scheduler processing log, task logs), preferably in debug mode > > > > >>>> (settings.py). Also show memory limits, cpu count and > airflow.cfg. > > > > >>>> > > > > >>>> Thanks > > > > >>>> Bolke > > > > >>>> > > > > >>>> > > > > >>>>> On 25 Mar 2017, at 18:16, Bolke de Bruin > > > wrote: > > > > >>>>> > > > > >>>>> Please specify what =E2=80=9Cstop doing its job=E2=80=9D mean= s. It doesn=E2=80=99t log > > > > >> anything > > > > >>>> anymore? If it does, the scheduler hasn=E2=80=99t died and has= n=E2=80=99t > stopped. > > > > >>>>> > > > > >>>>> B. > > > > >>>>> > > > > >>>>> > > > > >>>>>> On 24 Mar 2017, at 18:20, Gael Magnan > > > wrote: > > > > >>>>>> > > > > >>>>>> We encountered the same kind of problem with the scheduler > that > > > > >> stopped > > > > >>>>>> doing its job even after rebooting. I thought changing the > start > > > > date > > > > >>>> or > > > > >>>>>> the state of a task instance might be to blame but I've neve= r > > been > > > > >>>> able to > > > > >>>>>> pinpoint the problem either. > > > > >>>>>> > > > > >>>>>> We are using celery and docker if it helps. > > > > >>>>>> > > > > >>>>>> Le sam. 25 mars 2017 =C3=A0 01:53, Bolke de Bruin < > bdbruin@gmail.com > > > > > > a > > > > >>>> =C3=A9crit : > > > > >>>>>> > > > > >>>>>>> We are running *without* num runs for over a year (and neve= r > > > have). > > > > >>>> It is > > > > >>>>>>> a very elusive issue which has not been reproducible. > > > > >>>>>>> > > > > >>>>>>> I like more info on this but it needs to be very elaborate > even > > > to > > > > >> the > > > > >>>>>>> point of access to the system exposing the behavior. > > > > >>>>>>> > > > > >>>>>>> Bolke > > > > >>>>>>> > > > > >>>>>>> Sent from my iPhone > > > > >>>>>>> > > > > >>>>>>>> On 24 Mar 2017, at 16:04, Vijay Ramesh > > > wrote: > > > > >>>>>>>> > > > > >>>>>>>> We literally have a cron job that restarts the scheduler > every > > > 30 > > > > >>>> min. > > > > >>>>>>> Num > > > > >>>>>>>> runs didn't work consistently in rc4, sometimes it would > > restart > > > > >>>> itself > > > > >>>>>>> and > > > > >>>>>>>> sometimes we'd end up with a few zombie scheduler processe= s > > and > > > > >>>> things > > > > >>>>>>>> would get stuck. Also running locally, without celery. > > > > >>>>>>>> > > > > >>>>>>>>> On Mar 24, 2017 16:02, wrote: > > > > >>>>>>>>> > > > > >>>>>>>>> We have max runs set and still hit this. Our solution is > > > dumber: > > > > >>>>>>>>> monitoring log output, and kill the scheduler if it stops > > > > >> emitting. > > > > >>>>>>> Works > > > > >>>>>>>>> like a charm. > > > > >>>>>>>>> > > > > >>>>>>>>>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu < > > > > >>>> fhakan.koklu@gmail.com> > > > > >>>>>>>>> wrote: > > > > >>>>>>>>>> > > > > >>>>>>>>>> Some solutions to this problem is restarting the schedul= er > > > > >>>> frequently > > > > >>>>>>> or > > > > >>>>>>>>>> some sort of monitoring on the scheduler. We have set up= a > > dag > > > > >> that > > > > >>>>>>> pings > > > > >>>>>>>>>> cronitor (a dead man's snitch typ= e > > of > > > > >>>> service) > > > > >>>>>>>>> every > > > > >>>>>>>>>> 10 minutes and the snitch pages you when the scheduler > dies > > > and > > > > >>>> does > > > > >>>>>>> not > > > > >>>>>>>>>> send a ping to it. > > > > >>>>>>>>>> > > > > >>>>>>>>>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips < > > > > >>>>>>> aphillips@qrmedia.com> > > > > >>>>>>>>>> wrote: > > > > >>>>>>>>>> > > > > >>>>>>>>>>> We use celery and run into it from time to time. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Bang goes my theory ;-) At least, assuming it's the sam= e > > > > >>>> underlying > > > > >>>>>>>>>>> cause... > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Regards > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> ap > > > > >>>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>> > > > > >>>>> > > > > >>>> > > > > >>>> > > > > >>> > > > > >> > > > > > > > > > > -- > > > > > > > > > > > > > > > Read our founder's story. > > > > > > > > > > > > > > > *This message may contain confidential, proprietary, or protected > > > > > information. If you are not the intended recipient, you may not > > > review, > > > > > copy, or distribute this message. If you received this message in > > > error, > > > > > please notify the sender by reply email and delete this message.* > > > > > > > > > > -- > > > > > > > > > Read our founder's story. > > > > > > > > > *This message may contain confidential, proprietary, or protected > > > information. If you are not the intended recipient, you may not > review, > > > copy, or distribute this message. If you received this message in > error, > > > please notify the sender by reply email and delete this message.* > > > > > > --=20 Read our founder's story.=20 *This message may contain confidential, proprietary, or protected=20 information. If you are not the intended recipient, you may not review,=20 copy, or distribute this message. If you received this message in error,=20 please notify the sender by reply email and delete this message.* --001a114b2f129cfe17054bbb554e--