airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhen Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AIRFLOW-1329) Problematic DAG cause worker queue saturated
Date Tue, 20 Jun 2017 17:42:00 GMT

     [ https://issues.apache.org/jira/browse/AIRFLOW-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zhen Zhang updated AIRFLOW-1329:
--------------------------------
    Description: 
We see this weird issue in our production airflow cluster:

# User has a problematic import statement in DAG definition.
# For some still unknown reasons, our scheduler and workers have different PYTHONPATH settings
such that the scheduler is able to parse the DAG successfully, but the workers fails on import.
# What we observed is that, on the worker side, all the tasks in the problematic DAG are in
"queued" state, while on the scheduler side, the scheduler keeps requeue hundreds of thousands
of duplicated tasks. As a result, it quickly saturates the worker queue and blocks normal
tasks to run. 

I think a better way to handle this would be either mark the user task as failed, or the scheduler
has some rate limit in requeueing tasks, and leave the cluster unaffected by user errors like
this.
 

  was:
We see this weird issue in our production airflow cluster:

# User has a problematic import statement in DAG definition.
# For some still unknown reasons, our scheduler and workers have different PYTHONPATH settings
such that the scheduler is able to parse the DAG successfully, but the workers fails on import.
# What we observed is that, on the worker side, all the tasks in the problematic DAG are in
"queued" state, while on the scheduler side, the scheduler keeps requeue hundreds of thousands
of tasks. As a result, it quickly saturates the worker queue and blocks normal tasks to run.


I think a better way to handle this would be either mark the user task as failed, or the scheduler
has some rate limit in requeueing tasks, and leave the cluster unaffected by user errors like
this.
 


> Problematic DAG cause worker queue saturated
> --------------------------------------------
>
>                 Key: AIRFLOW-1329
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1329
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: scheduler
>            Reporter: Zhen Zhang
>
> We see this weird issue in our production airflow cluster:
> # User has a problematic import statement in DAG definition.
> # For some still unknown reasons, our scheduler and workers have different PYTHONPATH
settings such that the scheduler is able to parse the DAG successfully, but the workers fails
on import.
> # What we observed is that, on the worker side, all the tasks in the problematic DAG
are in "queued" state, while on the scheduler side, the scheduler keeps requeue hundreds of
thousands of duplicated tasks. As a result, it quickly saturates the worker queue and blocks
normal tasks to run. 
> I think a better way to handle this would be either mark the user task as failed, or
the scheduler has some rate limit in requeueing tasks, and leave the cluster unaffected by
user errors like this.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message