airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bolke de Bruin <bdbr...@gmail.com>
Subject Re: Scheduler silently dies
Date Mon, 27 Mar 2017 17:40:39 GMT
Is this:

1. On 1.8.0? 1.7.1 is not supported anymore. 
2. localexecutor?

Your are running with N=10, can you try running without it?

B. 

Sent from my iPhone

> On 27 Mar 2017, at 10:28, Nicholas Hodgkinson <nik.hodgkinson@collectivehealth.com>
wrote:
> 
> Ok, I'm not sure how helpful this is and I'm working on getting some more
> information, but here's some preliminary data:
> 
> Process tree (`ps axjf`):
>    1  2391  2391  2391 ?           -1 Ssl    999   0:13 /usr/bin/python
> usr/local/bin/airflow scheduler -n 10
> 2391  2435  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2436  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2437  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2438  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2439  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2440  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2441  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2442  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2443  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2444  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2454  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2456  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2457  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2458  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2459  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2460  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2461  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2462  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2463  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2464  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2465  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 2391  2466  2391  2391 ?           -1 Z      999   0:00  \_
> [/usr/bin/python] <defunct>
> 
> # gdb python 2391
> Reading symbols from python...Reading symbols from
> /usr/lib/debug//usr/bin/python2.7...done.
> done.
> Attaching to program: /usr/bin/python, process 2391
> Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from
> /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
> done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> 0x00007f0c1bbb9670 in ?? ()
> (gdb) bt
> #0  0x00007f0c1bbb9670 in ?? ()
> #1  0x00007f0c1bf1a000 in ?? ()
> #2  0x00007f0c12099b45 in ?? ()
> #3  0x00000000032dbe00 in ?? ()
> #4  0x0000000000000000 in ?? ()
> (gdb) py-bt
> (gdb) py-list
> Unable to locate python frame
> 
> I know that's not super helpful, but it's information; I've also tried
> pyrasite, but got nothing from it of any use. This problem occurs for me
> very often and I'm happy to provide a modified environment in which to
> capture info if anyone has a suggestion. For now I need to restart my
> process and get my jobs running again.
> 
> -N
> nik.hodgkinson@collectivehealth.com
> 
> 
> On Sun, Mar 26, 2017 at 7:48 AM, Gerard Toonstra <gtoonstra@gmail.com>
> wrote:
> 
>>> 
>>> 
>>> By the way, I remember that the scheduler would only spawn one or three
>>> processes, but I may be wrong.
>>> Right now when I start, it spawns 7 separate processes for the scheduler
>>> (8 total) with some additional
>>> ones spawned when the dag file processor starts.
>>> 
>>> 
>> These other processes were executor processes. Hopefully with the tips
>> below someone who's getting this error
>> regularly can attach and dump the thread stack and we see what's going on.
>> 
>> Rgds,
>> 
>> Gerard
>> 
>> 
>>> 
>>> On Sun, Mar 26, 2017 at 3:21 AM, Bolke de Bruin <bdbruin@gmail.com>
>> wrote:
>>> 
>>>> I case you *think* you have encountered a schedule *hang*, please
>> provide
>>>> a strace on the parent process, provide process list output that shows
>>>> defunct scheduler processes, and provide *all* logging (main logs,
>>>> scheduler processing log, task logs), preferably in debug mode
>>>> (settings.py). Also show memory limits, cpu count and airflow.cfg.
>>>> 
>>>> Thanks
>>>> Bolke
>>>> 
>>>> 
>>>>> On 25 Mar 2017, at 18:16, Bolke de Bruin <bdbruin@gmail.com> wrote:
>>>>> 
>>>>> Please specify what “stop doing its job” means. It doesn’t log
>> anything
>>>> anymore? If it does, the scheduler hasn’t died and hasn’t stopped.
>>>>> 
>>>>> B.
>>>>> 
>>>>> 
>>>>>> On 24 Mar 2017, at 18:20, Gael Magnan <gaelmagnan@gmail.com>
wrote:
>>>>>> 
>>>>>> We encountered the same kind of problem with the scheduler that
>> stopped
>>>>>> doing its job even after rebooting. I thought changing the start
date
>>>> or
>>>>>> the state of a task instance might be to blame but I've never been
>>>> able to
>>>>>> pinpoint the problem either.
>>>>>> 
>>>>>> We are using celery and docker if it helps.
>>>>>> 
>>>>>> Le sam. 25 mars 2017 à 01:53, Bolke de Bruin <bdbruin@gmail.com>
a
>>>> écrit :
>>>>>> 
>>>>>>> We are running *without* num runs for over a year (and never
have).
>>>> It is
>>>>>>> a very elusive issue which has not been reproducible.
>>>>>>> 
>>>>>>> I like more info on this but it needs to be very elaborate even
to
>> the
>>>>>>> point of access to the system exposing the behavior.
>>>>>>> 
>>>>>>> Bolke
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>> 
>>>>>>>> On 24 Mar 2017, at 16:04, Vijay Ramesh <vijay@change.org>
wrote:
>>>>>>>> 
>>>>>>>> We literally have a cron job that restarts the scheduler
every 30
>>>> min.
>>>>>>> Num
>>>>>>>> runs didn't work consistently in rc4, sometimes it would
restart
>>>> itself
>>>>>>> and
>>>>>>>> sometimes we'd end up with a few zombie scheduler processes
and
>>>> things
>>>>>>>> would get stuck. Also running locally, without celery.
>>>>>>>> 
>>>>>>>>> On Mar 24, 2017 16:02, <lrohde@quartethealth.com>
wrote:
>>>>>>>>> 
>>>>>>>>> We have max runs set and still hit this. Our solution
is dumber:
>>>>>>>>> monitoring log output, and kill the scheduler if it stops
>> emitting.
>>>>>>> Works
>>>>>>>>> like a charm.
>>>>>>>>> 
>>>>>>>>>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu <
>>>> fhakan.koklu@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Some solutions to this problem is restarting the
scheduler
>>>> frequently
>>>>>>> or
>>>>>>>>>> some sort of monitoring on the scheduler. We have
set up a dag
>> that
>>>>>>> pings
>>>>>>>>>> cronitor <https://cronitor.io/> (a dead man's
snitch type of
>>>> service)
>>>>>>>>> every
>>>>>>>>>> 10 minutes and the snitch pages you when the scheduler
dies and
>>>> does
>>>>>>> not
>>>>>>>>>> send a ping to it.
>>>>>>>>>> 
>>>>>>>>>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips
<
>>>>>>> aphillips@qrmedia.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> We use celery and run into it from time to time.
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Bang goes my theory ;-) At least, assuming it's
the same
>>>> underlying
>>>>>>>>>>> cause...
>>>>>>>>>>> 
>>>>>>>>>>> Regards
>>>>>>>>>>> 
>>>>>>>>>>> ap
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
> 
> -- 
> 
> 
> Read our founder's story. 
> <https://collectivehealth.com/blog/started-collective-health/>
> 
> *This message may contain confidential, proprietary, or protected 
> information.  If you are not the intended recipient, you may not review, 
> copy, or distribute this message. If you received this message in error, 
> please notify the sender by reply email and delete this message.*

Mime
View raw message