airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From George Leslie-Waksman <geo...@cloverhealth.com.INVALID>
Subject Re: Completed tasks not being marked as completed
Date Fri, 04 Aug 2017 17:24:41 GMT
Pretty sure (not 100%) what is happening is:

   1. Scheduler bugs result in task getting scheduled twice
   2. Worker 1 grabs task
   3. Worker 2 grabs task
   4. Worker 1 starts task
   5. Worker 2 starts task
   6. Worker 2 sees that Worker 1 has started and plans to abort
   7. Worker 1 finishes and marks task as done
   8. Worker 2 finishes aborting and marks task as not done


On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mweil@newrelic.com> wrote:

> Hey Max,
>
> Thanks for the suggestions. I believe it was a retry (I'm using remote
> logging so I can only check after the task completes), but the UI never
> reported it as such. The latest_heartbeat column is in the jobs table, and
> interestingly I do see some running jobs that haven't heartbeated for ~22
> minutes. They are LocalTaskJob instances with CeleryExecutor properly
> listed as the executory class. I can't really correlate these to a specific
> task instance, however, as there doesn't appear to be any key written to
> the jobs table (the dag_id column is all null, and there's no task_id
> column or anything).
>
> Any ideas on what could be making these tasks stop heartbeating regularly?
> That could explain why eventually (after an overnight period of time)
> everything is marked as finished in the Airflow UI: eventually these tasks
> do heartbeat again, but quite long after they are finished running.
>
> Thanks again!
> ᐧ
>
> --
> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
> New Relic
>
> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > Are you sure there hasn't been a retry at that point? [One of] the
> expected
> > behavior is the one I described, where if a task finished without
> reporting
> > it's success [or failure], it will stay marked as RUNNING, but will fail
> to
> > emit a heartbeat (which is a timestamp updated in the task_instance table
> > as last_heartbeat or something).  The scheduler monitors for RUNNING
> tasks
> > without heartbeat and eventually will handle the failure (send emails,
> call
> > on_failure_callback, ...).
> >
> > Looking for heartbeat in the DB might give you some clues as to what is
> > going on.
> >
> > Also there have been versions where we'd occasionally see double
> > triggering, and double firing, which can be confusing. Then you can have
> > different processes reporting their status and debugging those issues can
> > be problematic. I think there's good prevention against that now, using
> > database transactions as the task instance sets itself as RUNNING. I'm
> not
> > sure if 1.8.0 is 100% clean from that regard.
> >
> > Max
> >
> > On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mweil@newrelic.com> wrote:
> >
> > > It happens mostly when the scheduler is catching up. More specifically,
> > > when I load a brand new DAG with a start date in the past. Usually I
> have
> > > it set to run 5 DAG runs at the same time, and up to 16 tasks at the
> same
> > > time.
> > >
> > > What I've also noticed is that the tasks will sit completed in reality
> > but
> > > uncompleted in the Airflow DB for many hours, but if I just leave them
> > all
> > > sitting there over night they all tend to be marked complete the next
> > > morning. Perhaps this points to some sort of Celery timeout or
> connection
> > > retry interval?
> > > ᐧ
> > >
> > > --
> > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> Engagement
> > |
> > > New Relic
> > >
> > > On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> > > maximebeauchemin@gmail.com> wrote:
> > >
> > > > By the time "INFO - Task exited with return code 0" gets logged, the
> > task
> > > > should have been marked as successful by the subprocess. I have no
> > > specific
> > > > intuition as to what the issue may be.
> > > >
> > > > I'm guessing at that point the job stops emitting heartbeat and
> > > eventually
> > > > the scheduler will handle it as a failure?
> > > >
> > > > How often does that happen?
> > > >
> > > > Max
> > > >
> > > > On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mweil@newrelic.com>
> wrote:
> > > >
> > > > > From what I can tell, it only affects CeleryExecutor. I've never
> seen
> > > > this
> > > > > behavior with LocalExecutor before.
> > > > >
> > > > > Max, do you know anything about this type of failure mode?
> > > > > ᐧ
> > > > >
> > > > > --
> > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > Engagement
> > > > |
> > > > > New Relic
> > > > >
> > > > > On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
> thejonas@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > We have the exact same problem. In our case, it's a bash operator
> > > > > starting
> > > > > > a docker container. The container and process it ran exit, but
> the
> > > > > 'docker
> > > > > > run' command is still showing up in the process table, waiting
> for
> > an
> > > > > > event.
> > > > > > I'm trying to switch to LocalExecutor to see if that will help.
> > > > > >
> > > > > > _jonas
> > > > > >
> > > > > >
> > > > > > On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mweil@newrelic.com>
> > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > Has anyone seen the behavior when using CeleryExecutor
where
> > > workers
> > > > > will
> > > > > > > finish their tasks ("INFO - Task exited with return code
0"
> shows
> > > in
> > > > > the
> > > > > > > logs) but are never marked as complete in the airflow DB
or UI?
> > > > > > Effectively
> > > > > > > this causes tasks to hang even though they are complete,
and
> the
> > > DAG
> > > > > will
> > > > > > > not continue.
> > > > > > >
> > > > > > > This is happening on 1.8.0. Anyone else seen this or perhaps
> > have a
> > > > > > > workaround?
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > --
> > > > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing,
and
> > > > > Engagement
> > > > > > |
> > > > > > > New Relic
> > > > > > > ᐧ
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message