airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bolke de Bruin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-203) Scheduler fails to reliably schedule tasks when many dag runs are triggered
Date Wed, 01 Jun 2016 15:04:59 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15310444#comment-15310444
] 

Bolke de Bruin commented on AIRFLOW-203:
----------------------------------------

To start with a airflow-scheduler.log with a reasonable frame. If possible some logs from
the workers. Sanitized config also great.

> Scheduler fails to reliably schedule tasks when many dag runs are triggered
> ---------------------------------------------------------------------------
>
>                 Key: AIRFLOW-203
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-203
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: Airflow 1.7.1.2
>            Reporter: Sergei Iakhnin
>
> Using Airflow with Celery, Rabbitmq, and Postgres backend. Running 1 master node and
115 worker nodes, each with 8 cores. The workflow consists of series of 27 tasks, some of
which are nearly instantaneous and some take hours to complete. Dag runs are manually triggered,
about 3000 at a time, resulting in roughly 75 000 tasks.
> My observations are that the scheduling behaviour is extremely inconsistent, i.e. about
1000 tasks get scheduled and executed and then no new tasks get scheduled after that. Sometimes
it is enough to restart the scheduler for new tasks to get scheduled, sometimes the scheduler
and worker services need to be restarted multiple times to get any progress. When I look at
the scheduler output it seems to be chugging away at trying to schedule tasks with messages
like:
> "2016-06-01 11:28:25,908] {base_executor.py:34} INFO - Adding to queue: airflow run ..."
> However, these tasks do not show up in queued status on the UI and don't actually get
scheduled out to the workers (nor make it into the rabbitmq queue, or the task_instance table).
> It is unclear what may be causing this behaviour as no errors are produced anywhere.
The impact is especially high when short-running tasks are concerned because the cluster should
be able to blow through them within a couple of minutes, but instead it takes hours of manual
restarts to get through them.
> I'm happy to share logs or any other useful debug output as desired.
> Thanks in advance.
> Sergei.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message