Mailing-List: contact dev-help@airflow.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@airflow.incubator.apache.org
MIME-Version: 1.0
In-Reply-To: <CAM5819S26T8wffif0P9goJWPEFMODPq=oVmqSkYf6DQw7-qdwQ@mail.gmail.com>
References: <CAKco0CAsK8_SNYtDhGt0xA-tsQ=P4smhpjh4W1CJ8PeUUCRXiA@mail.gmail.com>
 <CADrzvVh6ideSEzcUfM5Hgv=v-snnnho7kQFJ2fy4fVp_29Vpng@mail.gmail.com>
 <CAKco0CBstP8y6onQk6FzceTiPe-+n_AwOnLyV2AaV6szxD1Fog@mail.gmail.com>
 <86efb405819a97a5732a89c811889ed7@qrmedia.com> <DD46294E-D252-4D3E-8C60-8AA291FA69D3@quartethealth.com>
 <db05dfdda12390ebef24af6de479bffa@qrmedia.com> <CAPLDUsgt8_88NELB+n14tdbM+2BcA_Zenec+-JTOFxeJwvsc+Q@mail.gmail.com>
 <750C73F0-8F92-4A7D-8DDE-4BA92F8F55FA@quartethealth.com> <CABjaDyT8qn1bhhSFxhSN63G=sT8vOt90JeCoFwWLBrWs7K-mXw@mail.gmail.com>
 <9FBAD372-FBCC-4C40-BD2E-622986C7A112@gmail.com> <CAH053HR05xfs4QegwcjsYb_Bw2k3hAWxUF4w3OzpfHNtws6i9A@mail.gmail.com>
 <1FD215B0-2F32-4F44-8814-CBB766004631@gmail.com> <7D333886-06C1-47DD-AF74-4F6142912511@gmail.com>
 <CAM5819QUJfTusTn3q=Qu4xm2s6bH30CEvq4oxRKVNq=A602a5w@mail.gmail.com>
 <CAM5819QhFmU88-Ce9ndQqtHNrvscpz5UPxncuq1f8VfEoX51Ew@mail.gmail.com>
 <CAKco0CB64Q8cZxuR9Ve3FoXcG-uZyvPvnQ1U-6E=yjrPUxnV6g@mail.gmail.com>
 <641BD719-CDC4-4D3D-8287-A6804F93378B@gmail.com> <CAKco0CDjdeyTVVX6bgK4+XhzTm4nm6KVJLNOZ73dAAv0Le1ksA@mail.gmail.com>
 <CADrzvVhH5-ceVYEgjJJscabhCZ3EA-8BTqU2j1k-Ogcny-DN_Q@mail.gmail.com> <CAM5819S26T8wffif0P9goJWPEFMODPq=oVmqSkYf6DQw7-qdwQ@mail.gmail.com>
From: Nicholas Hodgkinson <nik.hodgkinson@collectivehealth.com>
Date: Mon, 27 Mar 2017 12:26:49 -0700
Message-ID: <CAKco0CDs11z623=p56oYhv8v8fX57hTkLC88X=nmewJmxjQo2Q@mail.gmail.com>
Subject: Re: Scheduler silently dies
To: Airflow Dev List <dev@airflow.incubator.apache.org>
Content-Type: multipart/alternative; boundary=001a114b2f129cfe17054bbb554e
archived-at: Mon, 27 Mar 2017 19:27:42 -0000

--001a114b2f129cfe17054bbb554e
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

from airflow.cfg:

[core]
...
executor =3D LocalExecutor
parallelism =3D 32
dag_concurrency =3D 16
dags_are_paused_at_creation =3D True
non_pooled_task_slot_count =3D 128
max_active_runs_per_dag =3D 16
...

Pretty much the defaults; I've never tweaked these values.


-N
nik.hodgkinson@collectivehealth.com

On Mon, Mar 27, 2017 at 12:12 PM, Gerard Toonstra <gtoonstra@gmail.com>
wrote:

> So looks like the localworkers are dying. Airflow does not recover from
> that.
>
>
> In SchedulerJob (jobs.py), you can see the "_execute_helper"  function.
> This calls "executor.start()", which is implemented
> in local_executor.py in your case.
>
> The LocalExecutor is thus an object owned by the SchedulerJob. This
> executor creates x (parallellism) LocalWorkers,
> which derive from a multiprocessing.Process class. So the processes you s=
ee
> "extra" on the scheduler are those LocalWorkers
> as child processes. The LocalWorkers create additional processes through =
a
> shell ("subprocess.check_call" with (shell=3DTrue)),
> which are the things doing the actual work.
>
>
> Before that, on my 'master' here, the LocalWorker issues a *
> self.logger.info
> <http://self.logger.info>("{} running {}"  *, which you can find in the
> general
> output of the scheduler log file. When starting the scheduler with "airfl=
ow
> scheduler", it's what gets printed on the console and starts
> with "Starting the scheduler". That is the file you want to investigate.
>
> If anything bad happens with general processing, then it prints a:
>
>                 self.logger.error("failed to execute task
> {}:".format(str(e)))
>
> in the exception handler. I'd grep for that "failed to execute task" in t=
he
> scheduler log file I mentioned.
>
>
> I'm not sure where stdout/stderr go for these workers. If the call
> basically succeeded, but there were issues with the queue handling,
> then I'd expect this to go to stderr instead. I'm not 100% sure if that
> gets sent to the same scheduler log file or whether that goes nowhere
> because of it being a child process (they're probably inherited?).
>
>
> One further question: what's your parallellism set to?  I see 22 zombies
> left behind. Is that your setting?
>
> Let us know!
>
> Rgds,
>
> Gerard
>
>
>
> On Mon, Mar 27, 2017 at 8:13 PM, harish singh <harish.singh22@gmail.com>
> wrote:
>
> > 1.8:  increasing DAGBAG_IMPORT_TIMEOUT helps. I don't see the issue
> > (although not sure why tasks progress has become slow? But thats not th=
e
> > issue we are discussing here. So I am ignoring that here)
> >
> > 1.7:  our prod is running 1.7 and we havent seen the "defunct process"
> > issue for more than a week now. But we saw something very close to what
> > Nicholas provided (localexecutor, we do not use --num-runs)
> > Not sure if cpu/memory limit may lead to this issue. Often when we hit
> this
> > issue (which stalled the pipeline), we either increased the memory and/=
or
> > moved airflow to a bulkier (cpu) instance.
> >
> > Sorry for a late reply. Was out of town over the weekend.
> >
> >
> >
> > On Mon, Mar 27, 2017 at 10:47 AM, Nicholas Hodgkinson <
> > nik.hodgkinson@collectivehealth.com> wrote:
> >
> > > 1.7.1.3, however it seems this is still an issue in 1.8 according to
> > other
> > > posters. I'll upgrade today.
> > > Yes, localexecutor.
> > > Will remove -n 10
> > >
> > > -N
> > > nik.hodgkinson@collectivehealth.com
> > >
> > >
> > > On Mon, Mar 27, 2017 at 10:40 AM, Bolke de Bruin <bdbruin@gmail.com>
> > > wrote:
> > >
> > > > Is this:
> > > >
> > > > 1. On 1.8.0? 1.7.1 is not supported anymore.
> > > > 2. localexecutor?
> > > >
> > > > Your are running with N=3D10, can you try running without it?
> > > >
> > > > B.
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On 27 Mar 2017, at 10:28, Nicholas Hodgkinson <nik.hodgkinson@
> > > > collectivehealth.com> wrote:
> > > > >
> > > > > Ok, I'm not sure how helpful this is and I'm working on getting
> some
> > > more
> > > > > information, but here's some preliminary data:
> > > > >
> > > > > Process tree (`ps axjf`):
> > > > >    1  2391  2391  2391 ?           -1 Ssl    999   0:13
> > /usr/bin/python
> > > > > usr/local/bin/airflow scheduler -n 10
> > > > > 2391  2435  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2436  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2437  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2438  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2439  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2440  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2441  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2442  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2443  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2444  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2454  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2456  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2457  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2458  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2459  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2460  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2461  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2462  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2463  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2464  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2465  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2466  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > >
> > > > > # gdb python 2391
> > > > > Reading symbols from python...Reading symbols from
> > > > > /usr/lib/debug//usr/bin/python2.7...done.
> > > > > done.
> > > > > Attaching to program: /usr/bin/python, process 2391
> > > > > Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbol=
s
> > > from
> > > > > /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
> > > > > done.
> > > > > Loaded symbols for /lib64/ld-linux-x86-64.so.2
> > > > > 0x00007f0c1bbb9670 in ?? ()
> > > > > (gdb) bt
> > > > > #0  0x00007f0c1bbb9670 in ?? ()
> > > > > #1  0x00007f0c1bf1a000 in ?? ()
> > > > > #2  0x00007f0c12099b45 in ?? ()
> > > > > #3  0x00000000032dbe00 in ?? ()
> > > > > #4  0x0000000000000000 in ?? ()
> > > > > (gdb) py-bt
> > > > > (gdb) py-list
> > > > > Unable to locate python frame
> > > > >
> > > > > I know that's not super helpful, but it's information; I've also
> > tried
> > > > > pyrasite, but got nothing from it of any use. This problem occurs
> for
> > > me
> > > > > very often and I'm happy to provide a modified environment in whi=
ch
> > to
> > > > > capture info if anyone has a suggestion. For now I need to restar=
t
> my
> > > > > process and get my jobs running again.
> > > > >
> > > > > -N
> > > > > nik.hodgkinson@collectivehealth.com
> > > > >
> > > > >
> > > > > On Sun, Mar 26, 2017 at 7:48 AM, Gerard Toonstra <
> > gtoonstra@gmail.com>
> > > > > wrote:
> > > > >
> > > > >>>
> > > > >>>
> > > > >>> By the way, I remember that the scheduler would only spawn one =
or
> > > three
> > > > >>> processes, but I may be wrong.
> > > > >>> Right now when I start, it spawns 7 separate processes for the
> > > > scheduler
> > > > >>> (8 total) with some additional
> > > > >>> ones spawned when the dag file processor starts.
> > > > >>>
> > > > >>>
> > > > >> These other processes were executor processes. Hopefully with th=
e
> > tips
> > > > >> below someone who's getting this error
> > > > >> regularly can attach and dump the thread stack and we see what's
> > going
> > > > on.
> > > > >>
> > > > >> Rgds,
> > > > >>
> > > > >> Gerard
> > > > >>
> > > > >>
> > > > >>>
> > > > >>> On Sun, Mar 26, 2017 at 3:21 AM, Bolke de Bruin <
> bdbruin@gmail.com
> > >
> > > > >> wrote:
> > > > >>>
> > > > >>>> I case you *think* you have encountered a schedule *hang*,
> please
> > > > >> provide
> > > > >>>> a strace on the parent process, provide process list output th=
at
> > > shows
> > > > >>>> defunct scheduler processes, and provide *all* logging (main
> logs,
> > > > >>>> scheduler processing log, task logs), preferably in debug mode
> > > > >>>> (settings.py). Also show memory limits, cpu count and
> airflow.cfg.
> > > > >>>>
> > > > >>>> Thanks
> > > > >>>> Bolke
> > > > >>>>
> > > > >>>>
> > > > >>>>> On 25 Mar 2017, at 18:16, Bolke de Bruin <bdbruin@gmail.com>
> > > wrote:
> > > > >>>>>
> > > > >>>>> Please specify what =E2=80=9Cstop doing its job=E2=80=9D mean=
s. It doesn=E2=80=99t log
> > > > >> anything
> > > > >>>> anymore? If it does, the scheduler hasn=E2=80=99t died and has=
n=E2=80=99t
> stopped.
> > > > >>>>>
> > > > >>>>> B.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>> On 24 Mar 2017, at 18:20, Gael Magnan <gaelmagnan@gmail.com>
> > > wrote:
> > > > >>>>>>
> > > > >>>>>> We encountered the same kind of problem with the scheduler
> that
> > > > >> stopped
> > > > >>>>>> doing its job even after rebooting. I thought changing the
> start
> > > > date
> > > > >>>> or
> > > > >>>>>> the state of a task instance might be to blame but I've neve=
r
> > been
> > > > >>>> able to
> > > > >>>>>> pinpoint the problem either.
> > > > >>>>>>
> > > > >>>>>> We are using celery and docker if it helps.
> > > > >>>>>>
> > > > >>>>>> Le sam. 25 mars 2017 =C3=A0 01:53, Bolke de Bruin <
> bdbruin@gmail.com
> > >
> > > a
> > > > >>>> =C3=A9crit :
> > > > >>>>>>
> > > > >>>>>>> We are running *without* num runs for over a year (and neve=
r
> > > have).
> > > > >>>> It is
> > > > >>>>>>> a very elusive issue which has not been reproducible.
> > > > >>>>>>>
> > > > >>>>>>> I like more info on this but it needs to be very elaborate
> even
> > > to
> > > > >> the
> > > > >>>>>>> point of access to the system exposing the behavior.
> > > > >>>>>>>
> > > > >>>>>>> Bolke
> > > > >>>>>>>
> > > > >>>>>>> Sent from my iPhone
> > > > >>>>>>>
> > > > >>>>>>>> On 24 Mar 2017, at 16:04, Vijay Ramesh <vijay@change.org>
> > > wrote:
> > > > >>>>>>>>
> > > > >>>>>>>> We literally have a cron job that restarts the scheduler
> every
> > > 30
> > > > >>>> min.
> > > > >>>>>>> Num
> > > > >>>>>>>> runs didn't work consistently in rc4, sometimes it would
> > restart
> > > > >>>> itself
> > > > >>>>>>> and
> > > > >>>>>>>> sometimes we'd end up with a few zombie scheduler processe=
s
> > and
> > > > >>>> things
> > > > >>>>>>>> would get stuck. Also running locally, without celery.
> > > > >>>>>>>>
> > > > >>>>>>>>> On Mar 24, 2017 16:02, <lrohde@quartethealth.com> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>> We have max runs set and still hit this. Our solution is
> > > dumber:
> > > > >>>>>>>>> monitoring log output, and kill the scheduler if it stops
> > > > >> emitting.
> > > > >>>>>>> Works
> > > > >>>>>>>>> like a charm.
> > > > >>>>>>>>>
> > > > >>>>>>>>>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu <
> > > > >>>> fhakan.koklu@gmail.com>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Some solutions to this problem is restarting the schedul=
er
> > > > >>>> frequently
> > > > >>>>>>> or
> > > > >>>>>>>>>> some sort of monitoring on the scheduler. We have set up=
 a
> > dag
> > > > >> that
> > > > >>>>>>> pings
> > > > >>>>>>>>>> cronitor <https://cronitor.io/> (a dead man's snitch typ=
e
> > of
> > > > >>>> service)
> > > > >>>>>>>>> every
> > > > >>>>>>>>>> 10 minutes and the snitch pages you when the scheduler
> dies
> > > and
> > > > >>>> does
> > > > >>>>>>> not
> > > > >>>>>>>>>> send a ping to it.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips <
> > > > >>>>>>> aphillips@qrmedia.com>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> We use celery and run into it from time to time.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Bang goes my theory ;-) At least, assuming it's the sam=
e
> > > > >>>> underlying
> > > > >>>>>>>>>>> cause...
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Regards
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> ap
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > > >
> > > > > --
> > > > >
> > > > >
> > > > > Read our founder's story.
> > > > > <https://collectivehealth.com/blog/started-collective-health/>
> > > > >
> > > > > *This message may contain confidential, proprietary, or protected
> > > > > information.  If you are not the intended recipient, you may not
> > > review,
> > > > > copy, or distribute this message. If you received this message in
> > > error,
> > > > > please notify the sender by reply email and delete this message.*
> > > >
> > >
> > > --
> > >
> > >
> > > Read our founder's story.
> > > <https://collectivehealth.com/blog/started-collective-health/>
> > >
> > > *This message may contain confidential, proprietary, or protected
> > > information.  If you are not the intended recipient, you may not
> review,
> > > copy, or distribute this message. If you received this message in
> error,
> > > please notify the sender by reply email and delete this message.*
> > >
> >
>

--=20


Read our founder's story.=20
<https://collectivehealth.com/blog/started-collective-health/>

*This message may contain confidential, proprietary, or protected=20
information.  If you are not the intended recipient, you may not review,=20
copy, or distribute this message. If you received this message in error,=20
please notify the sender by reply email and delete this message.*

--001a114b2f129cfe17054bbb554e--