airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Hodgkinson <nik.hodgkin...@collectivehealth.com>
Subject Re: Scheduler silently dies
Date Mon, 27 Mar 2017 19:26:49 GMT
from airflow.cfg:

[core]
...
executor = LocalExecutor
parallelism = 32
dag_concurrency = 16
dags_are_paused_at_creation = True
non_pooled_task_slot_count = 128
max_active_runs_per_dag = 16
...

Pretty much the defaults; I've never tweaked these values.



-N
nik.hodgkinson@collectivehealth.com

On Mon, Mar 27, 2017 at 12:12 PM, Gerard Toonstra <gtoonstra@gmail.com>
wrote:

> So looks like the localworkers are dying. Airflow does not recover from
> that.
>
>
> In SchedulerJob (jobs.py), you can see the "_execute_helper"  function.
> This calls "executor.start()", which is implemented
> in local_executor.py in your case.
>
> The LocalExecutor is thus an object owned by the SchedulerJob. This
> executor creates x (parallellism) LocalWorkers,
> which derive from a multiprocessing.Process class. So the processes you see
> "extra" on the scheduler are those LocalWorkers
> as child processes. The LocalWorkers create additional processes through a
> shell ("subprocess.check_call" with (shell=True)),
> which are the things doing the actual work.
>
>
> Before that, on my 'master' here, the LocalWorker issues a *
> self.logger.info
> <http://self.logger.info>("{} running {}"  *, which you can find in the
> general
> output of the scheduler log file. When starting the scheduler with "airflow
> scheduler", it's what gets printed on the console and starts
> with "Starting the scheduler". That is the file you want to investigate.
>
> If anything bad happens with general processing, then it prints a:
>
>                 self.logger.error("failed to execute task
> {}:".format(str(e)))
>
> in the exception handler. I'd grep for that "failed to execute task" in the
> scheduler log file I mentioned.
>
>
> I'm not sure where stdout/stderr go for these workers. If the call
> basically succeeded, but there were issues with the queue handling,
> then I'd expect this to go to stderr instead. I'm not 100% sure if that
> gets sent to the same scheduler log file or whether that goes nowhere
> because of it being a child process (they're probably inherited?).
>
>
> One further question: what's your parallellism set to?  I see 22 zombies
> left behind. Is that your setting?
>
> Let us know!
>
> Rgds,
>
> Gerard
>
>
>
> On Mon, Mar 27, 2017 at 8:13 PM, harish singh <harish.singh22@gmail.com>
> wrote:
>
> > 1.8:  increasing DAGBAG_IMPORT_TIMEOUT helps. I don't see the issue
> > (although not sure why tasks progress has become slow? But thats not the
> > issue we are discussing here. So I am ignoring that here)
> >
> > 1.7:  our prod is running 1.7 and we havent seen the "defunct process"
> > issue for more than a week now. But we saw something very close to what
> > Nicholas provided (localexecutor, we do not use --num-runs)
> > Not sure if cpu/memory limit may lead to this issue. Often when we hit
> this
> > issue (which stalled the pipeline), we either increased the memory and/or
> > moved airflow to a bulkier (cpu) instance.
> >
> > Sorry for a late reply. Was out of town over the weekend.
> >
> >
> >
> > On Mon, Mar 27, 2017 at 10:47 AM, Nicholas Hodgkinson <
> > nik.hodgkinson@collectivehealth.com> wrote:
> >
> > > 1.7.1.3, however it seems this is still an issue in 1.8 according to
> > other
> > > posters. I'll upgrade today.
> > > Yes, localexecutor.
> > > Will remove -n 10
> > >
> > > -N
> > > nik.hodgkinson@collectivehealth.com
> > >
> > >
> > > On Mon, Mar 27, 2017 at 10:40 AM, Bolke de Bruin <bdbruin@gmail.com>
> > > wrote:
> > >
> > > > Is this:
> > > >
> > > > 1. On 1.8.0? 1.7.1 is not supported anymore.
> > > > 2. localexecutor?
> > > >
> > > > Your are running with N=10, can you try running without it?
> > > >
> > > > B.
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On 27 Mar 2017, at 10:28, Nicholas Hodgkinson <nik.hodgkinson@
> > > > collectivehealth.com> wrote:
> > > > >
> > > > > Ok, I'm not sure how helpful this is and I'm working on getting
> some
> > > more
> > > > > information, but here's some preliminary data:
> > > > >
> > > > > Process tree (`ps axjf`):
> > > > >    1  2391  2391  2391 ?           -1 Ssl    999   0:13
> > /usr/bin/python
> > > > > usr/local/bin/airflow scheduler -n 10
> > > > > 2391  2435  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2436  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2437  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2438  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2439  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2440  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2441  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2442  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2443  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2444  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2454  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2456  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2457  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2458  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2459  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2460  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2461  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2462  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2463  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2464  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2465  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > > 2391  2466  2391  2391 ?           -1 Z      999   0:00  \_
> > > > > [/usr/bin/python] <defunct>
> > > > >
> > > > > # gdb python 2391
> > > > > Reading symbols from python...Reading symbols from
> > > > > /usr/lib/debug//usr/bin/python2.7...done.
> > > > > done.
> > > > > Attaching to program: /usr/bin/python, process 2391
> > > > > Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols
> > > from
> > > > > /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
> > > > > done.
> > > > > Loaded symbols for /lib64/ld-linux-x86-64.so.2
> > > > > 0x00007f0c1bbb9670 in ?? ()
> > > > > (gdb) bt
> > > > > #0  0x00007f0c1bbb9670 in ?? ()
> > > > > #1  0x00007f0c1bf1a000 in ?? ()
> > > > > #2  0x00007f0c12099b45 in ?? ()
> > > > > #3  0x00000000032dbe00 in ?? ()
> > > > > #4  0x0000000000000000 in ?? ()
> > > > > (gdb) py-bt
> > > > > (gdb) py-list
> > > > > Unable to locate python frame
> > > > >
> > > > > I know that's not super helpful, but it's information; I've also
> > tried
> > > > > pyrasite, but got nothing from it of any use. This problem occurs
> for
> > > me
> > > > > very often and I'm happy to provide a modified environment in which
> > to
> > > > > capture info if anyone has a suggestion. For now I need to restart
> my
> > > > > process and get my jobs running again.
> > > > >
> > > > > -N
> > > > > nik.hodgkinson@collectivehealth.com
> > > > >
> > > > >
> > > > > On Sun, Mar 26, 2017 at 7:48 AM, Gerard Toonstra <
> > gtoonstra@gmail.com>
> > > > > wrote:
> > > > >
> > > > >>>
> > > > >>>
> > > > >>> By the way, I remember that the scheduler would only spawn
one or
> > > three
> > > > >>> processes, but I may be wrong.
> > > > >>> Right now when I start, it spawns 7 separate processes for
the
> > > > scheduler
> > > > >>> (8 total) with some additional
> > > > >>> ones spawned when the dag file processor starts.
> > > > >>>
> > > > >>>
> > > > >> These other processes were executor processes. Hopefully with
the
> > tips
> > > > >> below someone who's getting this error
> > > > >> regularly can attach and dump the thread stack and we see what's
> > going
> > > > on.
> > > > >>
> > > > >> Rgds,
> > > > >>
> > > > >> Gerard
> > > > >>
> > > > >>
> > > > >>>
> > > > >>> On Sun, Mar 26, 2017 at 3:21 AM, Bolke de Bruin <
> bdbruin@gmail.com
> > >
> > > > >> wrote:
> > > > >>>
> > > > >>>> I case you *think* you have encountered a schedule *hang*,
> please
> > > > >> provide
> > > > >>>> a strace on the parent process, provide process list
output that
> > > shows
> > > > >>>> defunct scheduler processes, and provide *all* logging
(main
> logs,
> > > > >>>> scheduler processing log, task logs), preferably in debug
mode
> > > > >>>> (settings.py). Also show memory limits, cpu count and
> airflow.cfg.
> > > > >>>>
> > > > >>>> Thanks
> > > > >>>> Bolke
> > > > >>>>
> > > > >>>>
> > > > >>>>> On 25 Mar 2017, at 18:16, Bolke de Bruin <bdbruin@gmail.com>
> > > wrote:
> > > > >>>>>
> > > > >>>>> Please specify what “stop doing its job” means.
It doesn’t log
> > > > >> anything
> > > > >>>> anymore? If it does, the scheduler hasn’t died and
hasn’t
> stopped.
> > > > >>>>>
> > > > >>>>> B.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>> On 24 Mar 2017, at 18:20, Gael Magnan <gaelmagnan@gmail.com>
> > > wrote:
> > > > >>>>>>
> > > > >>>>>> We encountered the same kind of problem with
the scheduler
> that
> > > > >> stopped
> > > > >>>>>> doing its job even after rebooting. I thought
changing the
> start
> > > > date
> > > > >>>> or
> > > > >>>>>> the state of a task instance might be to blame
but I've never
> > been
> > > > >>>> able to
> > > > >>>>>> pinpoint the problem either.
> > > > >>>>>>
> > > > >>>>>> We are using celery and docker if it helps.
> > > > >>>>>>
> > > > >>>>>> Le sam. 25 mars 2017 à 01:53, Bolke de Bruin
<
> bdbruin@gmail.com
> > >
> > > a
> > > > >>>> écrit :
> > > > >>>>>>
> > > > >>>>>>> We are running *without* num runs for over
a year (and never
> > > have).
> > > > >>>> It is
> > > > >>>>>>> a very elusive issue which has not been reproducible.
> > > > >>>>>>>
> > > > >>>>>>> I like more info on this but it needs to
be very elaborate
> even
> > > to
> > > > >> the
> > > > >>>>>>> point of access to the system exposing the
behavior.
> > > > >>>>>>>
> > > > >>>>>>> Bolke
> > > > >>>>>>>
> > > > >>>>>>> Sent from my iPhone
> > > > >>>>>>>
> > > > >>>>>>>> On 24 Mar 2017, at 16:04, Vijay Ramesh
<vijay@change.org>
> > > wrote:
> > > > >>>>>>>>
> > > > >>>>>>>> We literally have a cron job that restarts
the scheduler
> every
> > > 30
> > > > >>>> min.
> > > > >>>>>>> Num
> > > > >>>>>>>> runs didn't work consistently in rc4,
sometimes it would
> > restart
> > > > >>>> itself
> > > > >>>>>>> and
> > > > >>>>>>>> sometimes we'd end up with a few zombie
scheduler processes
> > and
> > > > >>>> things
> > > > >>>>>>>> would get stuck. Also running locally,
without celery.
> > > > >>>>>>>>
> > > > >>>>>>>>> On Mar 24, 2017 16:02, <lrohde@quartethealth.com>
wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>> We have max runs set and still hit
this. Our solution is
> > > dumber:
> > > > >>>>>>>>> monitoring log output, and kill the
scheduler if it stops
> > > > >> emitting.
> > > > >>>>>>> Works
> > > > >>>>>>>>> like a charm.
> > > > >>>>>>>>>
> > > > >>>>>>>>>> On Mar 24, 2017, at 5:50 PM,
F. Hakan Koklu <
> > > > >>>> fhakan.koklu@gmail.com>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Some solutions to this problem
is restarting the scheduler
> > > > >>>> frequently
> > > > >>>>>>> or
> > > > >>>>>>>>>> some sort of monitoring on the
scheduler. We have set up a
> > dag
> > > > >> that
> > > > >>>>>>> pings
> > > > >>>>>>>>>> cronitor <https://cronitor.io/>
(a dead man's snitch type
> > of
> > > > >>>> service)
> > > > >>>>>>>>> every
> > > > >>>>>>>>>> 10 minutes and the snitch pages
you when the scheduler
> dies
> > > and
> > > > >>>> does
> > > > >>>>>>> not
> > > > >>>>>>>>>> send a ping to it.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Fri, Mar 24, 2017 at 1:49
PM, Andrew Phillips <
> > > > >>>>>>> aphillips@qrmedia.com>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> We use celery and run into
it from time to time.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Bang goes my theory ;-) At
least, assuming it's the same
> > > > >>>> underlying
> > > > >>>>>>>>>>> cause...
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Regards
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> ap
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > > >
> > > > > --
> > > > >
> > > > >
> > > > > Read our founder's story.
> > > > > <https://collectivehealth.com/blog/started-collective-health/>
> > > > >
> > > > > *This message may contain confidential, proprietary, or protected
> > > > > information.  If you are not the intended recipient, you may not
> > > review,
> > > > > copy, or distribute this message. If you received this message in
> > > error,
> > > > > please notify the sender by reply email and delete this message.*
> > > >
> > >
> > > --
> > >
> > >
> > > Read our founder's story.
> > > <https://collectivehealth.com/blog/started-collective-health/>
> > >
> > > *This message may contain confidential, proprietary, or protected
> > > information.  If you are not the intended recipient, you may not
> review,
> > > copy, or distribute this message. If you received this message in
> error,
> > > please notify the sender by reply email and delete this message.*
> > >
> >
>

-- 


Read our founder's story. 
<https://collectivehealth.com/blog/started-collective-health/>

*This message may contain confidential, proprietary, or protected 
information.  If you are not the intended recipient, you may not review, 
copy, or distribute this message. If you received this message in error, 
please notify the sender by reply email and delete this message.*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message