Mailing-List: contact dev-help@airflow.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@airflow.incubator.apache.org
MIME-Version: 1.0
References: <CAMxvOesLs9dyioOpPJV7AGYct0JSwbMJAWVtepbPk1nNSu8mEQ@mail.gmail.com>
 <CANqi2N=aOxvtbZCW1BG2xUojUUPqzvbs8PycUvzPRkJA2B5FKA@mail.gmail.com>
 <CAMxvOes1Fd8+TLVKXMJoH2nOqM8XVqoSfQF02nWu0cWnvsWOJA@mail.gmail.com>
 <CAHEEp7WFKm67DWuiQ=KG7vbCV901EQDf2yNj-ydx_p=pQePtEQ@mail.gmail.com>
 <CAMxvOevFh5pon=wdEQtfZBQG9okRsKrRTjewX_zkGD7rG6XjxA@mail.gmail.com>
 <CAHEEp7WZrG=Xjqwy0Q0bVjr9+sw3YYkT1yybCA-aARGQxXfoSw@mail.gmail.com> <CAMxvOevi5jqvwmxkfQbNqqKU5dg-SrWc3_cFhPJeZeRaBPXrfQ@mail.gmail.com>
In-Reply-To: <CAMxvOevi5jqvwmxkfQbNqqKU5dg-SrWc3_cFhPJeZeRaBPXrfQ@mail.gmail.com>
From: George Leslie-Waksman <george@cloverhealth.com.INVALID>
Date: Fri, 04 Aug 2017 17:24:41 +0000
Message-ID: <CANcybpBr_i=t9qOo8-mhr5RK7Sx=SXbFeMAi=rS_C9TYM+remw@mail.gmail.com>
Subject: Re: Completed tasks not being marked as completed
To: dev@airflow.incubator.apache.org
Content-Type: multipart/alternative; boundary="001a113cf1f65ef3330555f0c6fa"
archived-at: Fri, 04 Aug 2017 17:25:11 -0000

--001a113cf1f65ef3330555f0c6fa
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Pretty sure (not 100%) what is happening is:

   1. Scheduler bugs result in task getting scheduled twice
   2. Worker 1 grabs task
   3. Worker 2 grabs task
   4. Worker 1 starts task
   5. Worker 2 starts task
   6. Worker 2 sees that Worker 1 has started and plans to abort
   7. Worker 1 finishes and marks task as done
   8. Worker 2 finishes aborting and marks task as not done


On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mweil@newrelic.com> wrote:

> Hey Max,
>
> Thanks for the suggestions. I believe it was a retry (I'm using remote
> logging so I can only check after the task completes), but the UI never
> reported it as such. The latest_heartbeat column is in the jobs table, an=
d
> interestingly I do see some running jobs that haven't heartbeated for ~22
> minutes. They are LocalTaskJob instances with CeleryExecutor properly
> listed as the executory class. I can't really correlate these to a specif=
ic
> task instance, however, as there doesn't appear to be any key written to
> the jobs table (the dag_id column is all null, and there's no task_id
> column or anything).
>
> Any ideas on what could be making these tasks stop heartbeating regularly=
?
> That could explain why eventually (after an overnight period of time)
> everything is marked as finished in the Airflow UI: eventually these task=
s
> do heartbeat again, but quite long after they are finished running.
>
> Thanks again!
> =E1=90=A7
>
> --
> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement =
|
> New Relic
>
> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > Are you sure there hasn't been a retry at that point? [One of] the
> expected
> > behavior is the one I described, where if a task finished without
> reporting
> > it's success [or failure], it will stay marked as RUNNING, but will fai=
l
> to
> > emit a heartbeat (which is a timestamp updated in the task_instance tab=
le
> > as last_heartbeat or something).  The scheduler monitors for RUNNING
> tasks
> > without heartbeat and eventually will handle the failure (send emails,
> call
> > on_failure_callback, ...).
> >
> > Looking for heartbeat in the DB might give you some clues as to what is
> > going on.
> >
> > Also there have been versions where we'd occasionally see double
> > triggering, and double firing, which can be confusing. Then you can hav=
e
> > different processes reporting their status and debugging those issues c=
an
> > be problematic. I think there's good prevention against that now, using
> > database transactions as the task instance sets itself as RUNNING. I'm
> not
> > sure if 1.8.0 is 100% clean from that regard.
> >
> > Max
> >
> > On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mweil@newrelic.com> wrote:
> >
> > > It happens mostly when the scheduler is catching up. More specificall=
y,
> > > when I load a brand new DAG with a start date in the past. Usually I
> have
> > > it set to run 5 DAG runs at the same time, and up to 16 tasks at the
> same
> > > time.
> > >
> > > What I've also noticed is that the tasks will sit completed in realit=
y
> > but
> > > uncompleted in the Airflow DB for many hours, but if I just leave the=
m
> > all
> > > sitting there over night they all tend to be marked complete the next
> > > morning. Perhaps this points to some sort of Celery timeout or
> connection
> > > retry interval?
> > > =E1=90=A7
> > >
> > > --
> > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> Engagement
> > |
> > > New Relic
> > >
> > > On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> > > maximebeauchemin@gmail.com> wrote:
> > >
> > > > By the time "INFO - Task exited with return code 0" gets logged, th=
e
> > task
> > > > should have been marked as successful by the subprocess. I have no
> > > specific
> > > > intuition as to what the issue may be.
> > > >
> > > > I'm guessing at that point the job stops emitting heartbeat and
> > > eventually
> > > > the scheduler will handle it as a failure?
> > > >
> > > > How often does that happen?
> > > >
> > > > Max
> > > >
> > > > On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mweil@newrelic.com>
> wrote:
> > > >
> > > > > From what I can tell, it only affects CeleryExecutor. I've never
> seen
> > > > this
> > > > > behavior with LocalExecutor before.
> > > > >
> > > > > Max, do you know anything about this type of failure mode?
> > > > > =E1=90=A7
> > > > >
> > > > > --
> > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > Engagement
> > > > |
> > > > > New Relic
> > > > >
> > > > > On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
> thejonas@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > We have the exact same problem. In our case, it's a bash operat=
or
> > > > > starting
> > > > > > a docker container. The container and process it ran exit, but
> the
> > > > > 'docker
> > > > > > run' command is still showing up in the process table, waiting
> for
> > an
> > > > > > event.
> > > > > > I'm trying to switch to LocalExecutor to see if that will help.
> > > > > >
> > > > > > _jonas
> > > > > >
> > > > > >
> > > > > > On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mweil@newrelic.com>
> > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > Has anyone seen the behavior when using CeleryExecutor where
> > > workers
> > > > > will
> > > > > > > finish their tasks ("INFO - Task exited with return code 0"
> shows
> > > in
> > > > > the
> > > > > > > logs) but are never marked as complete in the airflow DB or U=
I?
> > > > > > Effectively
> > > > > > > this causes tasks to hang even though they are complete, and
> the
> > > DAG
> > > > > will
> > > > > > > not continue.
> > > > > > >
> > > > > > > This is happening on 1.8.0. Anyone else seen this or perhaps
> > have a
> > > > > > > workaround?
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > --
> > > > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > > > Engagement
> > > > > > |
> > > > > > > New Relic
> > > > > > > =E1=90=A7
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

--001a113cf1f65ef3330555f0c6fa--