airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nidhi (Jira)" <>
Subject [jira] [Commented] (AIRFLOW-203) Scheduler fails to reliably schedule tasks when many dag runs are triggered
Date Wed, 11 Dec 2019 21:27:00 GMT


Nidhi commented on AIRFLOW-203:

I am facing the same issue as I have around 60,000 tasks inside one DAG. When I trigger the
dag it is not scheduling my tasks and DAG is staying into Running state. Please let me know
if you know how to solve it. I am working with Celery Executor and tried to change "dagbag_import_timeout"
and "max_threads" but nothing is working for my case.

Any help to solve this issue will be appreciated.


> Scheduler fails to reliably schedule tasks when many dag runs are triggered
> ---------------------------------------------------------------------------
>                 Key: AIRFLOW-203
>                 URL:
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions:
>            Reporter: Sergei Iakhnin
>            Priority: Major
>         Attachments: airflow.cfg, airflow_scheduler_non_working.log, airflow_scheduler_working.log
> Using Airflow with Celery, Rabbitmq, and Postgres backend. Running 1 master node and
115 worker nodes, each with 8 cores. The workflow consists of series of 27 tasks, some of
which are nearly instantaneous and some take hours to complete. Dag runs are manually triggered,
about 3000 at a time, resulting in roughly 75 000 tasks.
> My observations are that the scheduling behaviour is extremely inconsistent, i.e. about
1000 tasks get scheduled and executed and then no new tasks get scheduled after that. Sometimes
it is enough to restart the scheduler for new tasks to get scheduled, sometimes the scheduler
and worker services need to be restarted multiple times to get any progress. When I look at
the scheduler output it seems to be chugging away at trying to schedule tasks with messages
> "2016-06-01 11:28:25,908] {} INFO - Adding to queue: airflow run ..."
> However, these tasks do not show up in queued status on the UI and don't actually get
scheduled out to the workers (nor make it into the rabbitmq queue, or the task_instance table).
> It is unclear what may be causing this behaviour as no errors are produced anywhere.
The impact is especially high when short-running tasks are concerned because the cluster should
be able to blow through them within a couple of minutes, but instead it takes hours of manual
restarts to get through them.
> I'm happy to share logs or any other useful debug output as desired.
> Thanks in advance.
> Sergei.

This message was sent by Atlassian Jira

View raw message