airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shreyas Joshi (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AIRFLOW-1555) Backfill job gets killed 1 hour after starting
Date Thu, 31 Aug 2017 21:36:00 GMT
Shreyas Joshi created AIRFLOW-1555:
--------------------------------------

             Summary: Backfill job gets killed 1 hour after starting
                 Key: AIRFLOW-1555
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1555
             Project: Apache Airflow
          Issue Type: Bug
          Components: backfill
    Affects Versions: 1.8.1
         Environment: Airflow 1.8.1
Celery 3.1.23 with one coordinator, redis and 3 workers
Python 3.5.2
Debian GNU/Linux 8.9 (jessie)
snakebite uninstalled because it does not work with Python 3.5.2
MySQL 5.6
            Reporter: Shreyas Joshi


*What happens?*

After running for an hour tasks in backfill tasks die. The task logs show:


{code}
...
[2017-08-31 06:48:06,425] {jobs.py:2172} WARNING - Recorded pid 5451 is not a descendant of
the current pid 21571
[2017-08-31 06:48:11,884] {jobs.py:2179} WARNING - State of this instance has been externally
set to failed. Taking the poison pill. So long.
[2017-08-31 06:48:11,892] {helpers.py:220} WARNING - Terminating descendant processes of [<REDACTED>]
PID: 5451
[2017-08-31 06:48:11,892] {helpers.py:224} WARNING - Terminating descendant process [<REDACTED>]
PID: 5459
[2017-08-31 06:48:11,896] {helpers.py:231} WARNING - Waiting up to 5s for processes to exit...
...
{code}

The backfill logs show:

{code}
...
[2017-08-31 11:23:44,025] {jobs.py:1729} ERROR - Executor reports task instance <TaskInstance:
dag_name.task_name 2017-08-30 02:00:00 [running]> finished (failed) although the task says
its running. Was the task killed externally?
[2017-08-31 11:23:44,025] {models.py:1427} ERROR - Executor reports task instance <TaskInstance:
analytics_events.page_views 2017-08-30 02:00:00 [running]> finished (failed) although the
task says its running. Was the task killed externally?
...
{code}

The Celery UI has the following exception, but status shows "success"

{code}
Traceback (most recent call last):
  File "/data/airflow-sources/.venv/lib/python3.5/site-packages/airflow/executors/celery_executor.py",
line 56, in execute_command
    subprocess.check_call(command, shell=True)
  File "/usr/share/pyenv/versions/3.5.2/lib/python3.5/subprocess.py", line 581, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'airflow run dag_name task_name 2017-08-30T02:00:00
--pickle 14 --local' returned non-zero exit status 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/airflow-sources/.venv/lib/python3.5/site-packages/celery/app/trace.py", line
240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/data/airflow-sources/.venv/lib/python3.5/site-packages/celery/app/trace.py", line
438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/data/airflow-sources/.venv/lib/python3.5/site-packages/airflow/executors/celery_executor.py",
line 59, in execute_command
    raise AirflowException('Celery command failed')
airflow.exceptions.AirflowException: Celery command failed
{code}

The tasks have timeouts explicitly set to 6 hours and SLA set to 5 hours. In the course of
debugging this I also set dagrun_timeout to 6 hours. It did not make a difference.

Here is a thread on [stackoverflow | https://stackoverflow.com/questions/44274381/airflow-long-running-task-in-subdag-marked-as-failed-after-an-hour]
that talks about a very similar issue.

These tasks run fine on our older Airflow 1.7. This is currently blocking our upgrade.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message