airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ash Berlin-Taylor <...@apache.org>
Subject Re: Removal of "run_duration" and its impact on orphaned tasks
Date Wed, 31 Jul 2019 17:10:22 GMT
Thanks for testing this out James, shame to discover we still have problems in that area. Do
you have an idea of how many tasks per day we are talking about here?

Your cluster schedules quite a large number of tasks over the day (in the 1k-10k range?) right?

I'd say whatever causes a task to become orphaned _while_ the scheduler is still running is
the actual bug, and running the orphan detection more often may just be replacing one patch
(the run duration) with another one (running the orphan detection more than at start up).

-ash

> On 31 Jul 2019, at 16:43, James Meickle <jmeickle@quantopian.com.INVALID> wrote:
> 
> In my testing of 1.10.4rc3, I discovered that we were getting hit by a
> process leak bug (which Ash has since fixed in 1.10.4rc4). This process
> leak was minimal impact for most users, but was exacerbated in our case by
> using "run_duration" to restart the scheduler every 10 minutes.
> 
> To mitigate that issue while remaining on the RC, we removed the use of
> "run_duration", since it is deprecated as of master anyways:
> https://github.com/apache/airflow/blob/master/UPDATING.md#remove-run_duration
> 
> Unfortunately, testing on our cluster (1.10.4rc3 plus no "run_duration")
> has revealed that while the process leak issue was mitigated, that we're
> now facing issues with orphaned tasks. These tasks are marked as
> "scheduled" by the scheduler, but _not_ successfully queued in Celery even
> after multiple scheduler loops. Around ~24h after last restart, we start
> having enough stuck tasks that the system starts paging and requires a
> manual restart.
> 
> Rather than generic "scheduler instability", this specific issue was one of
> the reasons why we'd originally added the scheduler restart. But it appears
> that on master, the orphaned task detection code still only runs on
> scheduler start despite removing "run_duration":
> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1328
> 
> Rather than immediately filing an issue I wanted to inquire a bit more
> about why this orphan detection code is only run on scheduler start,
> whether it would be safe to send in a PR to run it more often (maybe a
> tunable parameter?), and if there's some other configuration issue with
> Celery (in our case, backed by AWS Elasticache) that would cause us to see
> orphaned tasks frequently.


Mime
View raw message