airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From harish singh <harish.sing...@gmail.com>
Subject Re: Scheduler silently dies
Date Sat, 25 Mar 2017 02:07:53 GMT
We have been using (1.7) over a year and never faced this issue.
The moment we switched to 1.8, I think we have hit this issue.
The reason why I saw "I think" is because I am not sure if it is the same
issue. But whenever I restart, my pipeline proceeds.



*Airflow 1.7Having said that, In 1.7, I did face a similar issue (less than
5 times over a year): *
*I saw that there were lot of processes marked  "<defunct>"  with parent
process being "scheduler". *

*Somebody mentioned it in this jira ->
https://issues.apache.org/jira/browse/AIRFLOW-401
<https://issues.apache.org/jira/browse/AIRFLOW-401>*
*Workaround:  Restart scheduler*




*Airflow 1.8:Now the issue in 1.8 may be different then the issue in
1.7 But again the issue get solved and pipeline progresses on a SCHEDULER
RESTART.*If it may help, this is the trace in 1.8:
[2017-03-22 19:35:16,332] {models.py:167} INFO - Filling up the DagBag from
/usr/local/airflow/pipeline/pipeline.py [2017-03-22 19:35:22,451]
{airflow_configuration.py:40} INFO - loading setup.cfg file [2017-03-22
19:35:51,041] {timeout.py:37} ERROR - Process timed out [2017-03-22
19:35:51,041] {models.py:266} ERROR - Failed to import:
/usr/local/airflow/pipeline/pipeline.py Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 263,
in process_file m = imp.load_source(mod_name, filepath) File
"/usr/local/airflow/pipeline/pipeline.py", line 167, in <module>
create_tasks(dbguid, version, dag, override_start_date) File
"/usr/local/airflow/pipeline/pipeline.py", line 104, in create_tasks t =
create_task(dbguid, dag, taskInfo, version, override_date) File
"/usr/local/airflow/pipeline/pipeline.py", line 85, in create_task retries,
1, depends_on_past, version, override_dag_date) File
"/usr/local/airflow/pipeline/dags/base_pipeline.py", line 90, in
create_python_operator depends_on_past=depends_on_past) File
"/usr/local/lib/python2.7/dist-packages/airflow/utils/decorators.py", line
86, in wrapper result = func(*args, **kwargs) File
"/usr/local/lib/python2.7/dist-packages/airflow/operators/python_operator.py",
line 65, in __init__ super(PythonOperator, self).__init__(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/airflow/utils/decorators.py",
line 70, in wrapper sig = signature(func) File "/usr/local/lib/python2.7/
dist-packages/funcsigs/__init__.py", line 105, in signature return
Signature.from_function(obj) File "/usr/local/lib/python2.7/
dist-packages/funcsigs/__init__.py", line 594, in from_function
__validate_parameters__=False) File "/usr/local/lib/python2.7/
dist-packages/funcsigs/__init__.py", line 518, in __init__ for param in
parameters)) File "/usr/lib/python2.7/collections.py", line 52, in __init__
self.__update(*args, **kwds) File "/usr/lib/python2.7/_abcoll.py", line
548, in update self[key] = value File "/usr/lib/python2.7/collections.py",
line 61, in __setitem__ last[1] = root[0] = self.__map[key] = [last, root,
key] File "/usr/local/lib/python2.7/dist-packages/airflow/utils/timeout.py",
line 38, in handle_timeout raise AirflowTaskTimeout(self.error_message)
AirflowTaskTimeout: Timeout




On Fri, Mar 24, 2017 at 5:45 PM, Bolke de Bruin <bdbruin@gmail.com> wrote:

> We are running *without* num runs for over a year (and never have). It is
> a very elusive issue which has not been reproducible.
>
> I like more info on this but it needs to be very elaborate even to the
> point of access to the system exposing the behavior.
>
> Bolke
>
> Sent from my iPhone
>
> > On 24 Mar 2017, at 16:04, Vijay Ramesh <vijay@change.org> wrote:
> >
> > We literally have a cron job that restarts the scheduler every 30 min.
> Num
> > runs didn't work consistently in rc4, sometimes it would restart itself
> and
> > sometimes we'd end up with a few zombie scheduler processes and things
> > would get stuck. Also running locally, without celery.
> >
> >> On Mar 24, 2017 16:02, <lrohde@quartethealth.com> wrote:
> >>
> >> We have max runs set and still hit this. Our solution is dumber:
> >> monitoring log output, and kill the scheduler if it stops emitting.
> Works
> >> like a charm.
> >>
> >>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu <fhakan.koklu@gmail.com>
> >> wrote:
> >>>
> >>> Some solutions to this problem is restarting the scheduler frequently
> or
> >>> some sort of monitoring on the scheduler. We have set up a dag that
> pings
> >>> cronitor <https://cronitor.io/> (a dead man's snitch type of service)
> >> every
> >>> 10 minutes and the snitch pages you when the scheduler dies and does
> not
> >>> send a ping to it.
> >>>
> >>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips <
> aphillips@qrmedia.com>
> >>> wrote:
> >>>
> >>>> We use celery and run into it from time to time.
> >>>>>
> >>>>
> >>>> Bang goes my theory ;-) At least, assuming it's the same underlying
> >>>> cause...
> >>>>
> >>>> Regards
> >>>>
> >>>> ap
> >>>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message