airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Harris (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AIRFLOW-1941) Scheduler / executor loses tasks on restart when enforcing parallelism limit
Date Tue, 19 Dec 2017 11:31:01 GMT
Joseph Harris created AIRFLOW-1941:
--------------------------------------

             Summary: Scheduler / executor loses tasks on restart when enforcing parallelism
limit
                 Key: AIRFLOW-1941
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1941
             Project: Apache Airflow
          Issue Type: Bug
          Components: scheduler
    Affects Versions: 1.8.1, 1.9.0
         Environment: Linux
            Reporter: Joseph Harris


When running the scheduler with a limited number of cycles - eg:
{{airflow scheduler -n 30}}
and with {{PARALLELISM=32}} set in airflow.cfg

The Executor checks that {{len(self.running) < PARALLELISM}} before calling {{execute_async()}}
https://github.com/apache/incubator-airflow/blob/master/airflow/executors/base_executor.py#L98
When {{self.running}} is full for an extended period of time, the scheduler can exit without
having scheduled the remaining tasks in {{self.queued_tasks}}. When it restarts, the lots
tasks in {{self.queued_tasks}} don't get scheduled again, and get stuck in the queued state
until manually kicked.

We experienced issues with this when exiting tasks with clashing PIDs caused the CeleryExecutor's
{{self.running}} to become full of zombie jobs that could not complete.


* The Executor should not hold 'queued' tasks for an extended period of time, as it may exit
for any reason. The parallelism constraint should be checked alongside other dependencies.
* When shutting down 'gracefully', the scheduler should at least log a warning if there are
any tasks in self.queued_tasks
* Parallelism should be set to infinity if a queue-based/distributed executor is being used
(more risky)

This may be a common cause of tasks getting stuck in the 'queued' state when running Celery.

Although AIRFLOW-900 is resolved in 1.9.0, this issue is still present, and the scheduler
is still at risk of exiting without having scheduled tasks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message