airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bolke de Bruin <bdbr...@gmail.com>
Subject Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?
Date Sun, 17 Dec 2017 19:47:11 GMT
Quite important to know is, is that Airflow’s executors do not keep state after a restart.
This particularly affects distributed executors (celery, dask) as the workers are independent
from the scheduler. Thus at restart we reset all the tasks in the queued state that the executor
does not know about, which means all of them at the moment. Due to the distributed nature
of the executors, tasks can still be running. Normally a task will detect (on the heartbeat
interval) whether its state was changed externally and will terminate itself.

I have done some work some months ago to make the executor keep state over restarts, but never
got around to finish it.

So at the moment, to prevent requeuing, you need to make the airflow scheduler no go down
(as much).

Bolke.

P.S. I am assuming that you are talking about your scheduler going down, not workers

> On 17 Dec 2017, at 20:07, Christopher Bockman <chris@fathomhealth.co> wrote:
> 
> Upon further internal discussion, we might be seeing the task cloning
> because the postgres DB is getting into a corrupted state...but unclear.
> If consensus is we *shouldn't* be seeing this behavior, even as-is, we'll
> push more on that angle.
> 
> On Sun, Dec 17, 2017 at 10:45 AM, Christopher Bockman <chris@fathomhealth.co
>> wrote:
> 
>> Hi all,
>> 
>> We run DAGs, and sometimes Airflow crashes (for whatever reason--maybe
>> something as simple as the underlying infrastructure going down).
>> 
>> Currently, we run everything on Kubernetes (including Airflow), so the
>> Airflow pods crashes generally will be detected, and then they will restart.
>> 
>> However, if we have, e.g., a DAG that is running task X when it crashes,
>> when Airflow comes back up, it apparently sees task X didn't complete, so
>> it restarts the task (which, in this case, means it spins up an entirely
>> new instance/pod).  Thus, both run "X_1" and "X_2" are fired off
>> simultaneously.
>> 
>> Is there any (out of the box) way to better connect up state between tasks
>> and Airflow to prevent this?
>> 
>> (For additional context, we currently execute Kubernetes jobs via a custom
>> operator that basically layers on top of BashOperator...perhaps the new
>> Kubernetes operator will help address this?)
>> 
>> Thank you in advance for any thoughts,
>> 
>> Chris
>> 


Mime
View raw message