airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Kozikowski (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AIRFLOW-515) Airflow system is stuck without signs of life
Date Mon, 19 Sep 2016 19:48:21 GMT

     [ https://issues.apache.org/jira/browse/AIRFLOW-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Kozikowski updated AIRFLOW-515:
--------------------------------------
    Description: 
h1. Symptoms
Triggering a dag on the web node via the trigger_dag does not start it. I triggered the same
dag moment before and it worked. I did not apply any changes in the mean time. 

Very likely some deadlock on scheduler node deadlocked the whole system. 

Triggering the dag (airflow trigger_dag on the web node) only adds it to postgres table dag_run.
The run does not appear in the flower or the main web gui. System was stuck for 50 minutes
without any signs of life, and no logs coming from either scheduler, worker or webserver.

In the end, system got unblocked after force-restarting scheduler and worker. Is there any
way I should health-check my scheduler to avoid similar situation in production?  

h1. Recent logs
Logs were collected at 19:10, the system was stuck for almost an hour.
h2. scheduler
{noformat}
/usr/lib/python3.5/site-packages/sqlalchemy/sql/default_comparator.py:153: SAWarning: The
IN-predicate on "task_instance.execution_date" was invoked with an empty sequence. This results
in a contradiction, which nonetheless can be expensive to evaluate.  Consider alternative
strategies for improved performance.
  'strategies for improved performance.' % expr)
[2016-09-19 18:15:41,171] {jobs.py:741} INFO - Done queuing tasks, calling the executor's
heartbeat
[2016-09-19 18:15:41,172] {jobs.py:744} INFO - Loop took: 0.37502 seconds
[2016-09-19 18:15:41,189] {models.py:305} INFO - Finding 'running' jobs without a recent heartbeat
[2016-09-19 18:15:41,189] {models.py:311} INFO - Failing jobs without heartbeat after 2016-09-19
18:13:26.189579
[2016-09-19 18:15:41,212] {celery_executor.py:64} INFO - [celery] queuing ('repos_to_process_dag',
'update_repos_to_process', datetime.datetime(2016, 9, 19, 18, 15, 35, 200082)) through celery,
queue=default
[2016-09-19 18:15:45,817] {jobs.py:574} INFO - Prioritizing 0 queued jobs
[2016-09-19 18:15:45,831] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2016-09-19 18:15:45,881] {jobs.py:726} INFO - Starting 2 scheduler jobs
{noformat}
h2. web
{noformat}
[2016-09-19 18:10:12,213] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:12,597] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:12,663] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:12,790] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:13,210] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2016-09-19 18:10:13,742] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2016-09-19 18:10:13,815] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2016-09-19 18:10:13,912] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
{noformat}
h2. worker
{noformat}
Logging into: /usr/local/airflow/logs/repos_to_process_dag/update_repos_to_process/2016-09-19T18:15:35.200082
{noformat}


  was:
h1. Symptoms
Triggering a dag on the web node via the trigger_dag does not start it. Trigger the same dag
worked a moment before and I did not apply any config changes in the mean time, so it is likely
not an issue with the dag setup. Very likely some deadlock on scheduler node deadlocked the
whole system. 

Triggering the dag (airflow trigger_dag on the web node) only adds it to postgres table dag_run.
The run does not appear in the flower or the main web gui. System was stuck for 50 minutes
without any signs of life, and no logs coming from either scheduler, worker or webserver.

In the end, system got unblocked after force-restarting scheduler and worker. Is there any
way I should health-check my scheduler to avoid similar situation in production?  

h1. Recent logs
Logs were collected at 19:10, the system was stuck for almost an hour.
h2. scheduler
{noformat}
/usr/lib/python3.5/site-packages/sqlalchemy/sql/default_comparator.py:153: SAWarning: The
IN-predicate on "task_instance.execution_date" was invoked with an empty sequence. This results
in a contradiction, which nonetheless can be expensive to evaluate.  Consider alternative
strategies for improved performance.
  'strategies for improved performance.' % expr)
[2016-09-19 18:15:41,171] {jobs.py:741} INFO - Done queuing tasks, calling the executor's
heartbeat
[2016-09-19 18:15:41,172] {jobs.py:744} INFO - Loop took: 0.37502 seconds
[2016-09-19 18:15:41,189] {models.py:305} INFO - Finding 'running' jobs without a recent heartbeat
[2016-09-19 18:15:41,189] {models.py:311} INFO - Failing jobs without heartbeat after 2016-09-19
18:13:26.189579
[2016-09-19 18:15:41,212] {celery_executor.py:64} INFO - [celery] queuing ('repos_to_process_dag',
'update_repos_to_process', datetime.datetime(2016, 9, 19, 18, 15, 35, 200082)) through celery,
queue=default
[2016-09-19 18:15:45,817] {jobs.py:574} INFO - Prioritizing 0 queued jobs
[2016-09-19 18:15:45,831] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2016-09-19 18:15:45,881] {jobs.py:726} INFO - Starting 2 scheduler jobs
{noformat}
h2. web
{noformat}
[2016-09-19 18:10:12,213] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:12,597] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:12,663] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:12,790] {__init__.py:36} INFO - Using executor CeleryExecutor
[2016-09-19 18:10:13,210] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2016-09-19 18:10:13,742] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2016-09-19 18:10:13,815] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2016-09-19 18:10:13,912] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
{noformat}
h2. worker
{noformat}
Logging into: /usr/local/airflow/logs/repos_to_process_dag/update_repos_to_process/2016-09-19T18:15:35.200082
{noformat}



> Airflow system is stuck without signs of life
> ---------------------------------------------
>
>                 Key: AIRFLOW-515
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-515
>             Project: Apache Airflow
>          Issue Type: Bug
>         Environment: airflow==1.7.1.3 Dockerfile: https://gist.github.com/kozikow/9290c44f03fba8d0f1299f355ef96b51
> kubernetes config (running within minikube): https://gist.github.com/kozikow/49c812f11d1e4fd43e465e8090619274
>            Reporter: Robert Kozikowski
>
> h1. Symptoms
> Triggering a dag on the web node via the trigger_dag does not start it. I triggered the
same dag moment before and it worked. I did not apply any changes in the mean time. 
> Very likely some deadlock on scheduler node deadlocked the whole system. 
> Triggering the dag (airflow trigger_dag on the web node) only adds it to postgres table
dag_run. The run does not appear in the flower or the main web gui. System was stuck for 50
minutes without any signs of life, and no logs coming from either scheduler, worker or webserver.
> In the end, system got unblocked after force-restarting scheduler and worker. Is there
any way I should health-check my scheduler to avoid similar situation in production?  
> h1. Recent logs
> Logs were collected at 19:10, the system was stuck for almost an hour.
> h2. scheduler
> {noformat}
> /usr/lib/python3.5/site-packages/sqlalchemy/sql/default_comparator.py:153: SAWarning:
The IN-predicate on "task_instance.execution_date" was invoked with an empty sequence. This
results in a contradiction, which nonetheless can be expensive to evaluate.  Consider alternative
strategies for improved performance.
>   'strategies for improved performance.' % expr)
> [2016-09-19 18:15:41,171] {jobs.py:741} INFO - Done queuing tasks, calling the executor's
heartbeat
> [2016-09-19 18:15:41,172] {jobs.py:744} INFO - Loop took: 0.37502 seconds
> [2016-09-19 18:15:41,189] {models.py:305} INFO - Finding 'running' jobs without a recent
heartbeat
> [2016-09-19 18:15:41,189] {models.py:311} INFO - Failing jobs without heartbeat after
2016-09-19 18:13:26.189579
> [2016-09-19 18:15:41,212] {celery_executor.py:64} INFO - [celery] queuing ('repos_to_process_dag',
'update_repos_to_process', datetime.datetime(2016, 9, 19, 18, 15, 35, 200082)) through celery,
queue=default
> [2016-09-19 18:15:45,817] {jobs.py:574} INFO - Prioritizing 0 queued jobs
> [2016-09-19 18:15:45,831] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
> [2016-09-19 18:15:45,881] {jobs.py:726} INFO - Starting 2 scheduler jobs
> {noformat}
> h2. web
> {noformat}
> [2016-09-19 18:10:12,213] {__init__.py:36} INFO - Using executor CeleryExecutor
> [2016-09-19 18:10:12,597] {__init__.py:36} INFO - Using executor CeleryExecutor
> [2016-09-19 18:10:12,663] {__init__.py:36} INFO - Using executor CeleryExecutor
> [2016-09-19 18:10:12,790] {__init__.py:36} INFO - Using executor CeleryExecutor
> [2016-09-19 18:10:13,210] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
> [2016-09-19 18:10:13,742] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
> [2016-09-19 18:10:13,815] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
> [2016-09-19 18:10:13,912] {models.py:154} INFO - Filling up the DagBag from /usr/local/airflow/dags
> {noformat}
> h2. worker
> {noformat}
> Logging into: /usr/local/airflow/logs/repos_to_process_dag/update_repos_to_process/2016-09-19T18:15:35.200082
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message