airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergei Iakhnin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-191) Database connection leak on Postgresql backend
Date Tue, 31 May 2016 18:52:12 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308343#comment-15308343
] 

Sergei Iakhnin commented on AIRFLOW-191:
----------------------------------------

You are right about celery. 

I think the root cause is that db connections are not always appropriately disposed of. I.e.
I've made a patch where I close the session and dispose of the SQLAlchemy engine every time,
and this brings the issue under control somewhat, although this can't be the long term solution.

It seems too expensive that job heartbeats should consume as many db connections as there
are cores in the cluster. Maybe they could all send their heartbeats to a service that runs
on the manager node, and the service can manage a modestly sized connection pool?



> Database connection leak on Postgresql backend
> ----------------------------------------------
>
>                 Key: AIRFLOW-191
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-191
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor
>    Affects Versions: Airflow 1.7.1
>            Reporter: Sergei Iakhnin
>
> I raised this issue on github several months ago and there was even a PR but it never
maid it into mainline. Basically, workers tend to hang onto DB connections in Postgres for
recording heartbeat.
> I'm running a cluster with 115 workers, each with 8 slots. My Postgres DB is configured
to allow 1000 simultaneous connections. I should effectively be able to run 920 tasks at the
same time, but am actually limited to only about 450-480 because of idle transactions from
workers hanging on to DB connections.
> If I run the following query
> select count(*),state, client_hostname from pg_stat_activity group by state, client_hostname
> These are the results:
> count state client_hostname
> 1	active	(null)
> 1	idle	localhost
> 451	idle in transaction	(null)
> 446	idle	(null)
> 1	active	localhost
> The idle connections are all trying to run COMMIT
> The "idle in transaction" connections are all trying to run 
> SELECT job.id AS job_id, job.dag_id AS job_dag_id, job.state AS job_state, job.job_type
AS job_job_type, job.start_date AS job_start_date, job.end_date AS job_end_date, job.latest_heartbeat
AS job_latest_heartbeat, job.executor_class AS job_executor_class, job.hostname AS job_hostname,
job.unixname AS job_unixname 
> FROM job 
> WHERE job.id = 213823 
>  LIMIT 1
> with differing job.ids of course.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message