airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Harris (JIRA)" <>
Subject [jira] [Created] (AIRFLOW-1941) Scheduler / executor loses tasks on restart when enforcing parallelism limit
Date Tue, 19 Dec 2017 11:31:01 GMT
Joseph Harris created AIRFLOW-1941:

             Summary: Scheduler / executor loses tasks on restart when enforcing parallelism
                 Key: AIRFLOW-1941
             Project: Apache Airflow
          Issue Type: Bug
          Components: scheduler
    Affects Versions: 1.8.1, 1.9.0
         Environment: Linux
            Reporter: Joseph Harris

When running the scheduler with a limited number of cycles - eg:
{{airflow scheduler -n 30}}
and with {{PARALLELISM=32}} set in airflow.cfg

The Executor checks that {{len(self.running) < PARALLELISM}} before calling {{execute_async()}}
When {{self.running}} is full for an extended period of time, the scheduler can exit without
having scheduled the remaining tasks in {{self.queued_tasks}}. When it restarts, the lots
tasks in {{self.queued_tasks}} don't get scheduled again, and get stuck in the queued state
until manually kicked.

We experienced issues with this when exiting tasks with clashing PIDs caused the CeleryExecutor's
{{self.running}} to become full of zombie jobs that could not complete.

* The Executor should not hold 'queued' tasks for an extended period of time, as it may exit
for any reason. The parallelism constraint should be checked alongside other dependencies.
* When shutting down 'gracefully', the scheduler should at least log a warning if there are
any tasks in self.queued_tasks
* Parallelism should be set to infinity if a queue-based/distributed executor is being used
(more risky)

This may be a common cause of tasks getting stuck in the 'queued' state when running Celery.

Although AIRFLOW-900 is resolved in 1.9.0, this issue is still present, and the scheduler
is still at risk of exiting without having scheduled tasks

This message was sent by Atlassian JIRA

View raw message