airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stanislav Pak (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AIRFLOW-1463) Clear state of pending task when it fails due to DAG import error
Date Wed, 26 Jul 2017 01:07:01 GMT

     [ https://issues.apache.org/jira/browse/AIRFLOW-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stanislav Pak updated AIRFLOW-1463:
-----------------------------------
    Description: 
Our pipelines related code is deployed almost simultaneously on all airflow boxes: scheduler+webserver
box, workers boxes. Some common python package is deployed on those boxes on every other code
push (3-5 deployments per hour). Due to installation specifics, a DAG that imports module
from that package might fail. If DAG import fails when worker runs a task, the task is still
removed from the queue but task state is not changed, so in this case the task stays in QUEUED
state forever.

Beside the described case, there is scenario when it happens because of DAG update lag in
scheduler. A task can be scheduled with old DAG and worker can run the task with new DAG that
fails to be imported.

There might be other scenarios when it happens.

Proposal:
Catch errors when importing DAG on task run and clear task instance state if import fails.
This should fix transient issues of this kind.


  was:
Our pipelines related code is deployed almost simultaneously on all airflow boxes: scheduler+webserver
box, workers boxes. Some common python package is deployed on those boxes on every other code
push (3-5 deployments per hour). Due to installation specifics, a DAG that imports module
from that package might fail. If DAG import fails when worker runs a task, the task is still
removed from the queue but task state is not changed, so in this case the task stays in PENDING
state forever.

Beside the described case, there is scenario when it happens because of DAG update lag in
scheduler. A task can be scheduled with old DAG and worker can run the task with new DAG that
fails to be imported.

There might be other scenarios when it happens.

Proposal:
Catch errors when importing DAG on task run and clear task instance state if import fails.
This should fix transient issues of this kind.



> Clear state of pending task when it fails due to DAG import error
> -----------------------------------------------------------------
>
>                 Key: AIRFLOW-1463
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1463
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: cli
>         Environment: Ubuntu 14.04
> Airflow 1.8.0
> SQS backed task queue, AWS RDS backed meta storage
> DAG folder is synced by script on code push: archive is downloaded from s3, unpacked,
moved, install script is run. airflow executable is replaced with symlink pointing to the
latest version of code, no airflow processes are restarted.
>            Reporter: Stanislav Pak
>            Assignee: Stanislav Pak
>            Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Our pipelines related code is deployed almost simultaneously on all airflow boxes: scheduler+webserver
box, workers boxes. Some common python package is deployed on those boxes on every other code
push (3-5 deployments per hour). Due to installation specifics, a DAG that imports module
from that package might fail. If DAG import fails when worker runs a task, the task is still
removed from the queue but task state is not changed, so in this case the task stays in QUEUED
state forever.
> Beside the described case, there is scenario when it happens because of DAG update lag
in scheduler. A task can be scheduled with old DAG and worker can run the task with new DAG
that fails to be imported.
> There might be other scenarios when it happens.
> Proposal:
> Catch errors when importing DAG on task run and clear task instance state if import fails.
This should fix transient issues of this kind.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message