airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Liu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AIRFLOW-955) job failed to execute tasks
Date Wed, 08 Mar 2017 03:20:37 GMT
Jeff Liu created AIRFLOW-955:
--------------------------------

             Summary: job failed to execute tasks
                 Key: AIRFLOW-955
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-955
             Project: Apache Airflow
          Issue Type: Bug
            Reporter: Jeff Liu


I have recently run into a very strange issue that the job fails consistently to run tasks
or hung in a running state without running any tasks.

My dag structure has four levels, the main dag-> a subdag (level 1) -> a subdag( level2)
-> 5 tasks. 
The log shows that sometime the level2 subdag fails to execute tasks, but when I look for
the logs for tasks, I don't see any records that the tasks were ever executed (on the matching
ds date).

Here is the some logs from a level2 dag example:
{noformat}
--------------------------------------------------------------------------------
Attempt 1 out of 2
--------------------------------------------------------------------------------

[2017-03-07 06:45:55,230] {models.py:1041} INFO - Executing <Task(SubDagOperator): aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units>
on 2017-03-01 16:00:00
[2017-03-07 06:45:55,462] {base_executor.py:34} INFO - Adding to queue: airflow run aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_pr
od_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units mysql_to_gcs 2017-03-01T16:00:00
--local -s 2017-03-01T16:00:00 -sd DAGS_FOLDER/aws_prod_s
tore_hybris_to_bq.py
[2017-03-07 06:46:00,017] {celery_executor.py:62} INFO - [celery] queuing (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_stor
e_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units', 'mysql_to_gcs', datetime.datetime(2017,
3, 1, 16, 0)) through celery, queue=default
[2017-03-07 06:46:00,169] {jobs.py:813} INFO - [backfill progress] waiting: 6 | succeeded:
0 | kicked_off: 1 | failed: 0 | wont_run: 0
[2017-03-07 06:46:05,094] {jobs.py:813} INFO - [backfill progress] waiting: 6 | succeeded:
0 | kicked_off: 1 | failed: 0 | wont_run: 0
[2017-03-07 06:46:10,015] {jobs.py:813} INFO - [backfill progress] waiting: 6 | succeeded:
0 | kicked_off: 1 | failed: 0 | wont_run: 0
[2017-03-07 06:46:15,023] {jobs.py:813} INFO - [backfill progress] waiting: 6 | succeeded:
0 | kicked_off: 1 | failed: 0 | wont_run: 0
[2017-03-07 06:46:20,020] {jobs.py:813} INFO - [backfill progress] waiting: 6 | succeeded:
0 | kicked_off: 1 | failed: 0 | wont_run: 0
[2017-03-07 06:46:25,031] {jobs.py:813} INFO - [backfill progress] waiting: 6 | succeeded:
0 | kicked_off: 1 | failed: 0 | wont_run: 0
[2017-03-07 06:46:30,032] {jobs.py:813} INFO - [backfill progress] waiting: 6 | succeeded:
0 | kicked_off: 1 | failed: 0 | wont_run: 0
[2017-03-07 06:46:35,034] {jobs.py:813} INFO - [backfill progress] waiting: 6 | succeeded:
0 | kicked_off: 1 | failed: 0 | wont_run: 0
[2017-03-07 06:46:40,015] {jobs.py:813} INFO - [backfill progress] waiting: 6 | succeeded:
0 | kicked_off: 1 | failed: 0 | wont_run: 0
[2017-03-07 06:46:45,022] {jobs.py:813} INFO - [backfill progress] waiting: 6 | succeeded:
0 | kicked_off: 1 | failed: 0 | wont_run: 0
[2017-03-07 06:46:50,023] {jobs.py:813} INFO - [backfill progress] waiting: 6 | succeeded:
0 | kicked_off: 1 | failed: 0 | wont_run: 0
[2017-03-07 06:46:55,018] {jobs.py:813} INFO - [backfill progress] waiting: 6 | succeeded:
0 | kicked_off: 1 | failed: 0 | wont_run: 0
[2017-03-07 06:47:00,027] {jobs.py:813} INFO - [backfill progress] waiting: 6 | succeeded:
0 | kicked_off: 1 | failed: 0 | wont_run: 0
[2017-03-07 06:47:05,033] {jobs.py:813} INFO - [backfill progress] waiting: 6 | succeeded:
0 | kicked_off: 1 | failed: 0 | wont_run: 0
[2017-03-07 06:47:10,043] {base_executor.py:34} INFO - Adding to queue: airflow run aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_pr
od_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units gcs_to_bq 2017-03-01T16:00:00
--local -s 2017-03-01T16:00:00 -sd DAGS_FOLDER/aws_prod_stor
e_hybris_to_bq.py
[2017-03-07 06:47:15,023] {celery_executor.py:62} INFO - [celery] queuing (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_stor
e_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units', 'gcs_to_bq', datetime.datetime(2017,
3, 1, 16, 0)) through celery, queue=default
[2017-03-07 06:47:15,035] {jobs.py:813} INFO - [backfill progress] waiting: 5 | succeeded:
1 | kicked_off: 2 | failed: 0 | wont_run: 0
[2017-03-07 06:47:20,073] {jobs.py:813} INFO - [backfill progress] waiting: 5 | succeeded:
1 | kicked_off: 2 | failed: 0 | wont_run: 0
[2017-03-07 06:47:25,025] {jobs.py:813} INFO - [backfill progress] waiting: 5 | succeeded:
1 | kicked_off: 2 | failed: 0 | wont_run: 0
[2017-03-07 06:47:30,029] {jobs.py:813} INFO - [backfill progress] waiting: 5 | succeeded:
1 | kicked_off: 2 | failed: 0 | wont_run: 0
[2017-03-07 06:47:35,020] {jobs.py:813} INFO - [backfill progress] waiting: 5 | succeeded:
1 | kicked_off: 2 | failed: 0 | wont_run: 0
[2017-03-07 06:47:40,025] {jobs.py:813} INFO - [backfill progress] waiting: 5 | succeeded:
1 | kicked_off: 2 | failed: 0 | wont_run: 0
[2017-03-07 06:47:45,019] {jobs.py:813} INFO - [backfill progress] waiting: 5 | succeeded:
1 | kicked_off: 2 | failed: 0 | wont_run: 0
[2017-03-07 06:47:50,025] {jobs.py:813} INFO - [backfill progress] waiting: 5 | succeeded:
1 | kicked_off: 2 | failed: 0 | wont_run: 0
[2017-03-07 06:47:55,020] {jobs.py:813} INFO - [backfill progress] waiting: 5 | succeeded:
1 | kicked_off: 2 | failed: 0 | wont_run: 0
[2017-03-07 06:48:00,018] {jobs.py:813} INFO - [backfill progress] waiting: 5 | succeeded:
1 | kicked_off: 2 | failed: 0 | wont_run: 0
[2017-03-07 06:48:05,045] {jobs.py:813} INFO - [backfill progress] waiting: 5 | succeeded:
1 | kicked_off: 2 | failed: 0 | wont_run: 0
[2017-03-07 06:48:10,032] {jobs.py:813} INFO - [backfill progress] waiting: 5 | succeeded:
1 | kicked_off: 2 | failed: 0 | wont_run: 0
[2017-03-07 06:48:15,031] {jobs.py:813} INFO - [backfill progress] waiting: 5 | succeeded:
1 | kicked_off: 2 | failed: 0 | wont_run: 0
[2017-03-07 06:48:15,078] {base_executor.py:34} INFO - Adding to queue: airflow run aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_pr
od_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units update_bq_s_view 2017-03-01T16:00:00
--local -s 2017-03-01T16:00:00 -sd DAGS_FOLDER/aws_pr
od_store_hybris_to_bq.py
[2017-03-07 06:48:15,114] {base_executor.py:34} INFO - Adding to queue: airflow run aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_pr
od_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units set_bq_table_expiration
2017-03-01T16:00:00 --local -s 2017-03-01T16:00:00 -sd DAGS_FOLDER
/aws_prod_store_hybris_to_bq.py
[2017-03-07 06:48:15,146] {base_executor.py:34} INFO - Adding to queue: airflow run aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_pr
od_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units update_bq_view 2017-03-01T16:00:00
--local -s 2017-03-01T16:00:00 -sd DAGS_FOLDER/aws_prod
_store_hybris_to_bq.py
[2017-03-07 06:48:15,194] {base_executor.py:34} INFO - Adding to queue: airflow run aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_pr
od_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units remove_tmp_files 2017-03-01T16:00:00
--local -s 2017-03-01T16:00:00 -sd DAGS_FOLDER/aws_pr
od_store_hybris_to_bq.py
[2017-03-07 06:48:20,028] {celery_executor.py:62} INFO - [celery] queuing (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_stor
e_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units', 'set_bq_table_expiration',
datetime.datetime(2017, 3, 1, 16, 0)) through celery, queue=default
[2017-03-07 06:48:20,035] {celery_executor.py:62} INFO - [celery] queuing (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_stor
e_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units', 'update_bq_view', datetime.datetime(2017,
3, 1, 16, 0)) through celery, queue=default
[2017-03-07 06:48:20,037] {celery_executor.py:62} INFO - [celery] queuing (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_stor
e_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units', 'remove_tmp_files', datetime.datetime(2017,
3, 1, 16, 0)) through celery, queue=default
[2017-03-07 06:48:20,039] {celery_executor.py:62} INFO - [celery] queuing (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_stor
e_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units', 'update_bq_s_view', datetime.datetime(2017,
3, 1, 16, 0)) through celery, queue=default
[2017-03-07 06:48:20,050] {jobs.py:813} INFO - [backfill progress] waiting: 4 | succeeded:
2 | kicked_off: 6 | failed: 0 | wont_run: 0
[2017-03-07 06:48:25,242] {jobs.py:813} INFO - [backfill progress] waiting: 4 | succeeded:
2 | kicked_off: 6 | failed: 0 | wont_run: 0
[2017-03-07 06:48:30,050] {jobs.py:813} INFO - [backfill progress] waiting: 4 | succeeded:
2 | kicked_off: 6 | failed: 0 | wont_run: 0
[2017-03-07 06:48:35,042] {jobs.py:813} INFO - [backfill progress] waiting: 4 | succeeded:
2 | kicked_off: 6 | failed: 0 | wont_run: 0
[2017-03-07 06:48:40,058] {jobs.py:813} INFO - [backfill progress] waiting: 4 | succeeded:
2 | kicked_off: 6 | failed: 0 | wont_run: 0
[2017-03-07 06:48:45,047] {jobs.py:813} INFO - [backfill progress] waiting: 4 | succeeded:
2 | kicked_off: 6 | failed: 0 | wont_run: 0
[2017-03-07 06:48:50,070] {jobs.py:813} INFO - [backfill progress] waiting: 4 | succeeded:
2 | kicked_off: 6 | failed: 0 | wont_run: 0
[2017-03-07 06:48:55,055] {jobs.py:776} ERROR - Task instance (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_store_hybris_mys
ql_to_bq_dag.mysql_proxy_prod_store_hybris_units', 'update_bq_view', datetime.datetime(2017,
3, 1, 16, 0)) failed
[2017-03-07 06:48:55,055] {jobs.py:813} INFO - [backfill progress] waiting: 3 | succeeded:
2 | kicked_off: 6 | failed: 1 | wont_run: 0
[2017-03-07 06:49:00,060] {jobs.py:776} ERROR - Task instance (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_store_hybris_mys
ql_to_bq_dag.mysql_proxy_prod_store_hybris_units', 'remove_tmp_files', datetime.datetime(2017,
3, 1, 16, 0)) failed
[2017-03-07 06:49:00,060] {jobs.py:813} INFO - [backfill progress] waiting: 2 | succeeded:
2 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:49:05,039] {jobs.py:813} INFO - [backfill progress] waiting: 2 | succeeded:
2 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:49:10,038] {jobs.py:813} INFO - [backfill progress] waiting: 2 | succeeded:
2 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:49:15,059] {jobs.py:813} INFO - [backfill progress] waiting: 2 | succeeded:
2 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:49:20,044] {jobs.py:813} INFO - [backfill progress] waiting: 2 | succeeded:
2 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:49:25,044] {jobs.py:813} INFO - [backfill progress] waiting: 2 | succeeded:
2 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:49:30,033] {jobs.py:799} ERROR - The airflow run command failed at reporting
an error. This should not occur in normal circumstances. Task stat
e is 'None',reported state is 'success'. TI is <TaskInstance: aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_store_hybris_mysql_
to_bq_dag.mysql_proxy_prod_store_hybris_units.update_bq_s_view 2017-03-01 16:00:00 [None]>
[2017-03-07 06:49:30,034] {jobs.py:813} INFO - [backfill progress] waiting: 2 | succeeded:
2 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:49:30,033] {jobs.py:799} ERROR - The airflow run command failed at reporting
an error. This should not occur in normal circumstances. Task stat
e is 'None',reported state is 'success'. TI is <TaskInstance: aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_store_hybris_mysql_
to_bq_dag.mysql_proxy_prod_store_hybris_units.update_bq_s_view 2017-03-01 16:00:00 [None]>
[2017-03-07 06:49:30,034] {jobs.py:813} INFO - [backfill progress] waiting: 2 | succeeded:
2 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:49:30,034] {jobs.py:813} INFO - [backfill progress] waiting: 2 | succeeded:
2 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:49:30,087] {base_executor.py:34} INFO - Adding to queue: airflow run aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_pr
od_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units update_bq_s_view 2017-03-01T16:00:00
--local -s 2017-03-01T16:00:00 -sd DAGS_FOLDER/aws_pr
od_store_hybris_to_bq.py
[2017-03-07 06:49:35,042] {celery_executor.py:62} INFO - [celery] queuing (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_stor
e_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units', 'update_bq_s_view', datetime.datetime(2017,
3, 1, 16, 0)) through celery, queue=default
[2017-03-07 06:49:35,049] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded:
3 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:49:40,098] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded:
3 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:49:45,030] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded:
3 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:49:50,031] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded:
3 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:49:55,017] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded:
3 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:50:00,033] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded:
3 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:50:05,027] {jobs.py:813} INFO - [backfill progress] waiting: 1 | succeeded:
3 | kicked_off: 6 | failed: 2 | wont_run: 0
[2017-03-07 06:50:10,034] {jobs.py:776} ERROR - Task instance (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_store_hybris_mys
ql_to_bq_dag.mysql_proxy_prod_store_hybris_units', 'update_bq_s_view', datetime.datetime(2017,
3, 1, 16, 0)) failed
[2017-03-07 06:50:10,034] {jobs.py:813} INFO - [backfill progress] waiting: 0 | succeeded:
3 | kicked_off: 6 | failed: 3 | wont_run: 0
[2017-03-07 06:50:10,035] {models.py:1108} ERROR - ------------------------------------------
Some tasks instances failed, here's the list:
[(u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units',
'update_bq
_view', datetime.datetime(2017, 3, 1, 16, 0)), (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_store_hybris_mysql_to_bq_dag.my
sql_proxy_prod_store_hybris_units', 'remove_tmp_files', datetime.datetime(2017, 3, 1, 16,
0)), (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store
_hybris.aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units', 'update_bq_s_view',
datetime.datetime(2017, 3, 1, 16, 0))]
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1067, in run
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python2.7/dist-packages/airflow/operators/subdag_operator.py", line
45, in execute
    executor=self.executor)
  File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 2603, in run
    job.run()
  File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line 156, in run
    self._execute()
  File "/usr/local/lib/python2.7/dist-packages/airflow/jobs.py", line 822, in _execute
    raise AirflowException(msg)
AirflowException: ------------------------------------------
Some tasks instances failed, here's the list:
[(u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units',
'update_bq
_view', datetime.datetime(2017, 3, 1, 16, 0)), (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_store_hybris_mysql_to_bq_dag.my
sql_proxy_prod_store_hybris_units', 'remove_tmp_files', datetime.datetime(2017, 3, 1, 16,
0)), (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store
_hybris.aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units', 'update_bq_s_view',
datetime.datetime(2017, 3, 1, 16, 0))]
[2017-03-07 06:50:10,063] {models.py:1144} ERROR - ------------------------------------------
Some tasks instances failed, here's the list:
[(u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units',
'update_bq
_view', datetime.datetime(2017, 3, 1, 16, 0)), (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_store_hybris_mysql_to_bq_dag.my
sql_proxy_prod_store_hybris_units', 'remove_tmp_files', datetime.datetime(2017, 3, 1, 16,
0)), (u'aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store
_hybris.aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units', 'update_bq_s_view',
datetime.datetime(2017, 3, 1, 16, 0))]
{noformat}

One error that pops my eyes is:
{noformat}
ERROR - The airflow run command failed at reporting an error. This should not occur in normal
circumstances. Task stat
e is 'None',reported state is 'success'. TI is <TaskInstance: aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_store_hybris_mysql_
to_bq_dag.mysql_proxy_prod_store_hybris_units.update_bq_s_view 2017-03-01 16:00:00 [None]>
{noformat}
but it doesn't say what could have gone wrong.
in this example, the job
aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units
failed to run "remove_tmp_files" task, but when I "Browse" "Task Instances" in airflow UI,
and search for dag aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris.aws_prod_store_hybris_mysql_to_bq_dag.mysql_proxy_prod_store_hybris_units
tasks, I didn't find the "remove_temp_files" for that date at all.

screen shot attached.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message