airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "EKC (Erik Cederstrand)" <...@novozymes.com>
Subject Re: Failing jobs not properly terminated
Date Tue, 28 Feb 2017 15:57:59 GMT
Here's a follow-up with more information.

I think my suspicion is supported by the Airflow database. It has the task marked as failed
in the task_instance table, and the PID matches the dead job:

airflow=> select job_id, task_id, dag_id, state, pid from task_instance where job_id=127022;
 job_id | task_id | dag_id | state  |  pid
--------+---------+--------+--------+--------
 127022 | my_task | my_dag | failed | 146117
(1 row)

But in the jobs table, the job is apparently still running and being heartbeated by something:

airflow=> select id, state, start_date, latest_heartbeat from job where id=127022;
   id   |  state  |         start_date         | latest_heartbeat
--------+---------+----------------------------+---------------------------
 127022 | running | 2017-02-22 14:11:21.447172 | 2017-02-28 16:33:42.254843
(1 row)


But the job logfile clearly stated that it was "taking the pill". Since the job is never marked
as failed, the query in DagBag.kill_zombies() (https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L330)
does not find the job and it is never killed<https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L330>.

Kind regards,
Erik


________________________________
From: EKC (Erik Cederstrand)
Sent: Tuesday, February 28, 2017 4:10:04 PM
To: dev@airflow.incubator.apache.org
Subject: Failing jobs not properly terminated


Hi folks,


We're experiencing an issue where a failing DAG is not properly terminated. The log file contains
a Python stack trace of the failure and then ends with:


    {base_task_runner.py:112} INFO - Running: ['bash', '-c', 'airflow run my_dag my_task 2017-02-22T00:00:00
--job_id 127022 --raw -sd DAGS_FOLDER/my_dag.py']

          [...]

    {jobs.py:2127} WARNING - State of this instance has been externally set to failed. Taking
the poison pill. So long.


But the Celery process of the worker is never terminated:


    $ ps aux | grep 127022
    airflow  [...] python3 airflow run my_dag my_task 2017-02-22T00:00:00 --job_id 127022
--raw -sd DAGS_FOLDER/my_dag.py


Somehow, the scheduler does not see the job as failed, so the Celery queue quickly fills up
with dead jobs and nothing else gets worked on.


I had a look at jobs.py, and I have a suspicion that the self.terminating flag is never persisted
to the database. See https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2122
Is this correct? Everything else I can see in that file does a session.merge() to persist
values.


Kind regards,

Erik Cederstrand


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message