airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brent Driskill (Jira)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-6134) Scheduler hanging every 45 minutes, workers hanging on every job
Date Fri, 06 Dec 2019 16:45:00 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989963#comment-16989963
] 

Brent Driskill commented on AIRFLOW-6134:
-----------------------------------------

Just as an update, we are investigating this as an issue with Amazon Aurora Serverless. We
copied the data and set up a new DB and the problem stopped. There might still be an airflow
issue here since it was endlessly hanging instead of timing out but at least we have a workaround
at the moment.

> Scheduler hanging every 45 minutes, workers hanging on every job
> ----------------------------------------------------------------
>
>                 Key: AIRFLOW-6134
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6134
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler, worker
>    Affects Versions: 1.10.6
>            Reporter: Brent Driskill
>            Assignee: Ash Berlin-Taylor
>            Priority: Major
>
> We have been running Airflow successfully for the past few months. However, starting
on the morning of 11/27, the scheduler hung for an unknown reason. After restarting it, it
continued to hang every 30-45 minutes. We have temporarily implemented a health check to restart
it at this interval but the scheduler continues to not be reliable.
> The last logs during the hang are the following (this is just logged over and over, I
assume this is the other thread):
> {code:java}
> 17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1180} DEBUG - 0/2 DAG parsing processes
running
> 17:56:44[2019-11-30 17:56:44,048] {dag_processing.py:1183} DEBUG - 0 file paths queued
for processing
> 17:56:44[2019-11-30 17:56:44,049] {dag_processing.py:1246} DEBUG - Queuing the following
files for processing:
> {code}
> The last logs before that loop were the following:
> {code:java}
> 17:56:38[2019-11-30 17:56:38,450] {settings.py:277} DEBUG - Disposing DB connection pool
(PID 2232)
> 17:56:39[2019-11-30 17:56:39,036] {scheduler_job.py:267} DEBUG - Waiting for <Process(DagFileProcessor493-Process,
stopped)>
> 17:56:39[2019-11-30 17:56:39,036] {dag_processing.py:1162} DEBUG - Processor for <ommitted>
finished
> {code}
> Doing a py-spy on the running process, I see it hung at the following place:
> {code:java}
> Thread 4566 (idle): "MainThread"    
> connect (psycopg2/__init__.py:130)    
> connect (sqlalchemy/engine/default.py:482)    
> connect (sqlalchemy/engine/strategies.py:114)    
> __connect (sqlalchemy/pool/base.py:639)    
> __init__ (sqlalchemy/pool/base.py:437)    
> _create_connection (sqlalchemy/pool/base.py:308)    
> _do_get (sqlalchemy/pool/impl.py:136)    
> checkout (sqlalchemy/pool/base.py:492)    
> _checkout (sqlalchemy/pool/base.py:760)    
> connect (sqlalchemy/pool/base.py:363)    
> _wrap_pool_connect (sqlalchemy/engine/base.py:2276)    
> _contextual_connect (sqlalchemy/engine/base.py:2242)    
> _optional_conn_ctx_manager (sqlalchemy/engine/base.py:2040)    
> __enter__ (contextlib.py:112)    
> _run_visitor (sqlalchemy/engine/base.py:2048)    
> create_all (sqlalchemy/sql/schema.py:4316)    
> prepare_models (celery/backends/database/session.py:54)    
> session_factory (celery/backends/database/session.py:59)    
> ResultSession (celery/backends/database/__init__.py:99)    
> _get_task_meta_for (celery/backends/database/__init__.py:122)    
> _inner (celery/backends/database/__init__.py:53)    
> get_task_meta (celery/backends/base.py:386)    
> _get_task_meta (celery/result.py:412)    
> state (celery/result.py:473)    
> fetch_celery_task_state (airflow/executors/celery_executor.py:106)    
> mapstar (multiprocessing/pool.py:44)    
> worker (multiprocessing/pool.py:121)    
> run (multiprocessing/process.py:99)    
> _bootstrap (multiprocessing/process.py:297)    
> _launch (multiprocessing/popen_fork.py:74)    
> __init__ (multiprocessing/popen_fork.py:20)    
> _Popen (multiprocessing/context.py:277)    
> start (multiprocessing/process.py:112)    
> _repopulate_pool (multiprocessing/pool.py:241)    
> __init__ (multiprocessing/pool.py:176)    
> Pool (multiprocessing/context.py:119)    
> sync (airflow/executors/celery_executor.py:245)    
> heartbeat (airflow/executors/base_executor.py:136)    
> _execute_helper (airflow/jobs/scheduler_job.py:1445)    
> _execute (airflow/jobs/scheduler_job.py:1356)    
> run (airflow/jobs/base_job.py:222)    
> scheduler (airflow/bin/cli.py:1042)    
> wrapper (airflow/utils/cli.py:74)    <module> (airflow:37)
> {code}
> We are utilizing Postgres as our results_backend and using the CeleryExecutor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message