airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxime Beauchemin <maximebeauche...@gmail.com>
Subject Re: Completed tasks not being marked as completed
Date Fri, 28 Jul 2017 22:15:51 GMT
Are you sure there hasn't been a retry at that point? [One of] the expected
behavior is the one I described, where if a task finished without reporting
it's success [or failure], it will stay marked as RUNNING, but will fail to
emit a heartbeat (which is a timestamp updated in the task_instance table
as last_heartbeat or something).  The scheduler monitors for RUNNING tasks
without heartbeat and eventually will handle the failure (send emails, call
on_failure_callback, ...).

Looking for heartbeat in the DB might give you some clues as to what is
going on.

Also there have been versions where we'd occasionally see double
triggering, and double firing, which can be confusing. Then you can have
different processes reporting their status and debugging those issues can
be problematic. I think there's good prevention against that now, using
database transactions as the task instance sets itself as RUNNING. I'm not
sure if 1.8.0 is 100% clean from that regard.

Max

On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mweil@newrelic.com> wrote:

> It happens mostly when the scheduler is catching up. More specifically,
> when I load a brand new DAG with a start date in the past. Usually I have
> it set to run 5 DAG runs at the same time, and up to 16 tasks at the same
> time.
>
> What I've also noticed is that the tasks will sit completed in reality but
> uncompleted in the Airflow DB for many hours, but if I just leave them all
> sitting there over night they all tend to be marked complete the next
> morning. Perhaps this points to some sort of Celery timeout or connection
> retry interval?
> ᐧ
>
> --
> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
> New Relic
>
> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > By the time "INFO - Task exited with return code 0" gets logged, the task
> > should have been marked as successful by the subprocess. I have no
> specific
> > intuition as to what the issue may be.
> >
> > I'm guessing at that point the job stops emitting heartbeat and
> eventually
> > the scheduler will handle it as a failure?
> >
> > How often does that happen?
> >
> > Max
> >
> > On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mweil@newrelic.com> wrote:
> >
> > > From what I can tell, it only affects CeleryExecutor. I've never seen
> > this
> > > behavior with LocalExecutor before.
> > >
> > > Max, do you know anything about this type of failure mode?
> > > ᐧ
> > >
> > > --
> > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> Engagement
> > |
> > > New Relic
> > >
> > > On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <thejonas@gmail.com>
> > > wrote:
> > >
> > > > We have the exact same problem. In our case, it's a bash operator
> > > starting
> > > > a docker container. The container and process it ran exit, but the
> > > 'docker
> > > > run' command is still showing up in the process table, waiting for an
> > > > event.
> > > > I'm trying to switch to LocalExecutor to see if that will help.
> > > >
> > > > _jonas
> > > >
> > > >
> > > > On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mweil@newrelic.com>
> wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > Has anyone seen the behavior when using CeleryExecutor where
> workers
> > > will
> > > > > finish their tasks ("INFO - Task exited with return code 0" shows
> in
> > > the
> > > > > logs) but are never marked as complete in the airflow DB or UI?
> > > > Effectively
> > > > > this causes tasks to hang even though they are complete, and the
> DAG
> > > will
> > > > > not continue.
> > > > >
> > > > > This is happening on 1.8.0. Anyone else seen this or perhaps have
a
> > > > > workaround?
> > > > >
> > > > > Thanks!
> > > > >
> > > > > --
> > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > Engagement
> > > > |
> > > > > New Relic
> > > > > ᐧ
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message