airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vijay Krishna Ramesh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-931) LocalExecutor fails to run queued task with race condition
Date Thu, 02 Mar 2017 00:05:45 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891343#comment-15891343
] 

Vijay Krishna Ramesh commented on AIRFLOW-931:
----------------------------------------------

[~bolke] I added a bunch of manual logging to verify this. What seems to happen (in my simple
gist example) is DAG starts with 4 tasks. The multiple scheduler processes kick them off (the
jobs.py:1095) because at that time they are all runnable (since none are yet running). That
models.py:1291 does another check to make sure it is actually runnable (2 of them are, but
then due to the dag concurrency of 2, the third and fourth one aren't). The 2 tasks stay marked
as queued but never actually get picked up by the scheduler again (unless you restart it).
Is there some other state that that models.py:1291 check could move the tasks too if that
last minute check found they aren't actually runnable? (I found by making them State.NONE
it worked, but that seems hackish as it keeps bouncing back and forth between QUEUED and NONE
until it can actually run)

> LocalExecutor fails to run queued task with race condition
> ----------------------------------------------------------
>
>                 Key: AIRFLOW-931
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-931
>             Project: Apache Airflow
>          Issue Type: Bug
>    Affects Versions: Airflow 1.8, 1.8.0rc4
>            Reporter: Vijay Krishna Ramesh
>
> https://gist.github.com/vijaykramesh/707262c83429ab2a3f5ee701879813e3 provides a small
example that consistently hits this problem with LocalExecutor.
> Basically when the dag run kicks off (with concurrency > 1) and a LocalExecutor with
parallelism > 2 the scheduler marks more than concurrency tasks as queued (https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L1095)
> There is a second check before actually running the task (https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L1291)
that leaves the task in the QUEUED state but then the scheduler never picks it back up.  This
causes the DAG to get stuck (as the queued tasks never run) until the scheduler is restarted
(at which point the enqueued tasks are considered orphaned, the status is set to NONE, and
then they are picked up by the scheduler again and run.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message