airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "EKC (Erik Cederstrand)" <>
Subject Re: Failing jobs not properly terminated
Date Tue, 28 Feb 2017 15:57:59 GMT
Here's a follow-up with more information.

I think my suspicion is supported by the Airflow database. It has the task marked as failed
in the task_instance table, and the PID matches the dead job:

airflow=> select job_id, task_id, dag_id, state, pid from task_instance where job_id=127022;
 job_id | task_id | dag_id | state  |  pid
 127022 | my_task | my_dag | failed | 146117
(1 row)

But in the jobs table, the job is apparently still running and being heartbeated by something:

airflow=> select id, state, start_date, latest_heartbeat from job where id=127022;
   id   |  state  |         start_date         | latest_heartbeat
 127022 | running | 2017-02-22 14:11:21.447172 | 2017-02-28 16:33:42.254843
(1 row)

But the job logfile clearly stated that it was "taking the pill". Since the job is never marked
as failed, the query in DagBag.kill_zombies() (
does not find the job and it is never killed<>.

Kind regards,

From: EKC (Erik Cederstrand)
Sent: Tuesday, February 28, 2017 4:10:04 PM
Subject: Failing jobs not properly terminated

Hi folks,

We're experiencing an issue where a failing DAG is not properly terminated. The log file contains
a Python stack trace of the failure and then ends with:

    {} INFO - Running: ['bash', '-c', 'airflow run my_dag my_task 2017-02-22T00:00:00
--job_id 127022 --raw -sd DAGS_FOLDER/']


    {} WARNING - State of this instance has been externally set to failed. Taking
the poison pill. So long.

But the Celery process of the worker is never terminated:

    $ ps aux | grep 127022
    airflow  [...] python3 airflow run my_dag my_task 2017-02-22T00:00:00 --job_id 127022
--raw -sd DAGS_FOLDER/

Somehow, the scheduler does not see the job as failed, so the Celery queue quickly fills up
with dead jobs and nothing else gets worked on.

I had a look at, and I have a suspicion that the self.terminating flag is never persisted
to the database. See
Is this correct? Everything else I can see in that file does a session.merge() to persist

Kind regards,

Erik Cederstrand

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message