airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerard Toonstra <gtoons...@gmail.com>
Subject Re: Stuck Tasks that don't report status
Date Mon, 07 Aug 2017 20:30:20 GMT
Hi David,

When tasks are put on the MQ, they are out of the control of the scheduler.
The scheduler puts the state of that task instance in "queued".

What happens next:

1. A worker picks up the task to run and tries to run it.
2. It first executes a couple of checks against the DB prior to executing
this. These are final instance checks to see
    if it should still run when the worker is about to pick up the task
(another could have processed, started processing, etc).
3. The worker puts the state of the TI in "running".
4. The worker does the work as described in the operator
5. The worker then updates the database with fail or success.

If you kill the docker container doing the execution prior to it having
updated the state to success or fail,
it will get into a situation where a timeout must occur to get airflow to
see if the task failed or not. This is because
the worker is claiming to be processing the message, but this worker/task
got killed.

It is actually the task instance updating the database, so if you leave
that container running, it will possibly finish
and update the db.


The task results are also communicated back to the executors and there's a
check to see if the results agree.

You can find this code in models.py / Taskinstance / run()   and any
Executor you are using under (airflow/executors).


The reason why this happens I think is because docker doesn't really care
what's running at the moment, it's assuming 'services',
where you may have interruption of services because they are retried all
the time anyway. In an environment like airflow,
There's a persistent backend database that doesn't automatically retry
because it's driven through the scheduler, which only sees
a "RUNNING" record in the database.

How to deal with this depends on your situation. If you run only short
running tasks (up to 5 mins), you could drain the task queue
by stopping the scheduler first. This means no new messages are sent to the
queue, so after 10 mins you should have no tasks running on any workers.

Another way is to update the database inbetween, but I'd personally avoid
that as much as you can.


Not sure if anyone wants to chime in here on how to best deal with this in
docker?

Rgds,

Gerard


On Mon, Aug 7, 2017 at 8:21 PM, David Klosowski <davidk@thinknear.com>
wrote:

> Hi Airflow Dev List:
>
> Has anyone had cases where tasks get "stuck"?  What I mean by "stuck" is
> that tasks show as running through the Airflow UI but never actually run
> (and dependent tasks will eventually timeout).
>
> This only happens during our deployments and we replace all the hosts in
> our stack (3 workers and 1 host with the scheduler + webserver + flower)
> with a dockerized deployment.  We've been deploying to the worker hosts
> after the scheduler + webserver + flower host.
>
> It also doesn't occur all the time, which is a bit frustrating to try to
> debug.
>
> We have the following settings:
>
> > celery_result_backend = Postgres
> > sql_alchmey_conn = Postgres
> > broker_url = Redis
> > exector = CeleryExecutor
>
> Any thoughts from anyone regarding known issues or observed problems?  I
> haven't seen a jira on this after looking through the Airflow jira.
>
> Thanks.
>
> Regards,
> David
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message