airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Meickle <jmeic...@quantopian.com.INVALID>
Subject Re: Is `airflow backfill` disfunctional?
Date Mon, 04 Mar 2019 20:35:39 GMT
This is an old thread, but I wanted to bump it as I just had a really bad
experience using backfill. I'd been hesitant to even try backfills out
given what I've read about it, so I've just relied on the UI to "Clear"
entire tasks. However, I wanted to give it a shot the "right" way. Issues I
ran into:

- The dry run flag didn't give good feedback about which dagruns and task
instances will be affected (and is very easy to typo as "--dry-run")

- The terminal interface was uselessly verbose. It was scrolling fast
enough to be unreadable.

- The backfill exceeded safe concurrency limits for the cluster and
could've easily brought it down if I'd left it running.

- Tasks in the backfill were executed out of order despite the tasks having
`depends_on_past`

- The backfill converted all existing DAGRuns to be backfill runs that the
scheduler later ignored, which is not how I would've expected this to work
(nor was it indicated in the dry run)

I ended up having to do manual recovery work in the database to turn the
"backfill" runs back into scheduler runs, and then switch to using `airflow
clear`. I'm a heavy Airflow user and this took me an hour; it would've been
much worse for anyone else on my team.

I don't have any specific suggestions here other than to confirm that this
feature needs an overhaul if it's to be recommended to anyone.

On Fri, Jun 8, 2018 at 5:38 PM Maxime Beauchemin <maximebeauchemin@gmail.com>
wrote:

> Ash I don't see how this could happen unless maybe the node doing the
> backfill is using another metadata database.
>
> In general we recommend for people to run --local backfills and have the
> default/sandbox template for `airflow.cfg` use a LocalExecutor with
> reasonable parallelism to make that behavior the default.
>
> Given the [not-so-great] state of backfill, I'm guessing many have been
> using the scheduler to do backfills. From that regard it would be nice to
> have CLI commands to generate dagruns or alter the state of existing ones
>
> Max
>
> On Fri, Jun 8, 2018 at 8:56 AM Ash Berlin-Taylor <
> ash_airflowlist@firemirror.com> wrote:
>
> > Somewhat related to this, but likely a different issue:
> >
> > I've just had a case where a long (7hours) running backfill task ended up
> > running twice somehow. We're using Celery so this might be related to
> some
> > sort of Celery visibility timeout, but I haven't had a chance to be able
> to
> > dig in to it in detail - it's 5pm on a Friday :D
> >
> > Has anyone else noticed anything similar?
> >
> > -ash
> >
> >
> > > On 8 Jun 2018, at 01:22, Tao Feng <fengtao04@gmail.com> wrote:
> > >
> > > Thanks everyone for the feedback especially on the background for
> > backfill.
> > > After reading the discussion, I think it would be safest to add a flag
> > for
> > > auto rerun failed tasks for backfill with default to be false. I have
> > > updated the pr accordingly.
> > >
> > > Thanks a lot,
> > > -Tao
> > >
> > > On Wed, Jun 6, 2018 at 1:47 PM, Mark Whitfield <
> > mark.whitfield@nytimes.com>
> > > wrote:
> > >
> > >> I've been doing some work setting up a large, collaborative Airflow
> > >> pipeline with a group that makes heavy use of backfills, and have been
> > >> encountering a lot of these issues myself.
> > >>
> > >> Other gripes:
> > >>
> > >> Backfills do not obey concurrency pool restrictions. We had been
> making
> > >> heavy use of SubDAGs and using concurrency pools to prevent deadlocks
> > (why
> > >> does the SubDAG itself even need to occupy a concurrency slot if none
> of
> > >> its constituent tasks are running?), but this quickly became untenable
> > when
> > >> using backfills and we were forced to mostly abandon SubDAGs.
> > >>
> > >> Backfills do use DagRuns now, which is a big improvement. However,
> it's
> > a
> > >> common use case for us to add new tasks to a DAG and backfill to a
> date
> > >> specific to that task. When we do this, the BackfillJob will pick up
> > >> previous backfill DagRuns and re-use them, which is mostly nice
> because
> > it
> > >> keeps the Tree view neatly organized in the UI. However, it does not
> > reset
> > >> the start time of the DagRun when it does this. Combined with a
> > DAG-level
> > >> timeout, this means that the backfill job will activate a DagRun, but
> > then
> > >> the run will immediately time out (since it still thinks it's been
> > running
> > >> since the previous backfill). This will cause tasks to deadlock
> > spuriously,
> > >> making backfills extremely cumbersome to carry out.
> > >>
> > >> *Mark Whitfield*
> > >> Data Scientist
> > >> New York Times
> > >>
> > >>
> > >> On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <
> > >> maximebeauchemin@gmail.com>
> > >> wrote:
> > >>
> > >>> Thanks for the input, this is helpful.
> > >>>
> > >>> To add to the list, there's some complexity around concurrency
> > management
> > >>> and multiple executors:
> > >>> I just hit this thing where backfill doesn't check DAG-level
> > concurrency,
> > >>> fires up 32 tasks, and `airlfow run` double-checks DAG-level
> > concurrency
> > >>> limit and exits. Right after backfill reschedules right away and so
> on,
> > >>> burning a bunch of CPU doing nothing. In this specific case it seems
> > like
> > >>> `airflow run` should skip that specific check when in the context of
> a
> > >>> backfill.
> > >>>
> > >>> Max
> > >>>
> > >>> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <bdbruin@gmail.com>
> > wrote:
> > >>>
> > >>>> Thinking out loud here, because it is a while back that I did work
> on
> > >>>> backfills. There were some real issues with backfills:
> > >>>>
> > >>>> 1. Tasks were running in non deterministic order ending up in
> regular
> > >>>> deadlocks
> > >>>> 2. Didn’t create dag runs, making behavior inconsistent. Max
dag
> runs
> > >>>> could not be enforced. Ui could really display it, lots of minor
> other
> > >>>> issues because of it.
> > >>>> 3. Behavior was different from the scheduler, while subdagoperators
> > >>>> particularly make use of backfills at the moment.
> > >>>>
> > >>>> I think with 3 the behavior you are observing crept in. And given
3
> I
> > >>>> would argue a consistent behavior between the scheduler and the
> > >> backfill
> > >>>> mechanism is still paramount. Thus we should explicitly clear tasks
> > >> from
> > >>>> failed if we want to rerun them. This at least until we move the
> > >>>> subdagoperator out of backfill and into the scheduler (which is
> > >> actually
> > >>>> not too hard). Also we need those command line options anyway.
> > >>>>
> > >>>> Bolke
> > >>>>
> > >>>> Verstuurd vanaf mijn iPad
> > >>>>
> > >>>>> Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <
> > >> scott.halgrim@zapier.com
> > >>> .INVALID>
> > >>>> het volgende geschreven:
> > >>>>>
> > >>>>> The request was for opposition, but I’d like to weigh in
on the
> side
> > >> of
> > >>>> “it’s a better behavior [to have failed tasks re-run when cleared
> in a
> > >>>> backfill"
> > >>>>>> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> > >>>> maximebeauchemin@gmail.com>, wrote:
> > >>>>>> @Jeremiah Lowin <jlowin@gmail.com> & @Bolke de
Bruin <
> > >>> bdbruin@gmail.com>
> > >>>> I
> > >>>>>> think you may have some context on why this may have changed
at
> some
> > >>>> point.
> > >>>>>> I'm assuming that when DagRun handling was added to the
backfill
> > >>> logic,
> > >>>> the
> > >>>>>> behavior just happened to change to what it is now.
> > >>>>>>
> > >>>>>> Any opposition in moving back towards re-running failed
tasks when
> > >>>> starting
> > >>>>>> a backfill? I think it's a better behavior, though it's
a change
> in
> > >>>>>> behavior that we should mention in UPDATE.md.
> > >>>>>>
> > >>>>>> One of our goals is to make sure that a failed or killed
backfill
> > >> can
> > >>> be
> > >>>>>> restarted and just seamlessly pick up where it left off.
> > >>>>>>
> > >>>>>> Max
> > >>>>>>
> > >>>>>>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fengtao04@gmail.com>
> > >> wrote:
> > >>>>>>>
> > >>>>>>> After discussing with Max, we think it would be great
if `airflow
> > >>>> backfill`
> > >>>>>>> could be able to auto pick up and rerun those failed
tasks.
> > >>> Currently,
> > >>>> it
> > >>>>>>> will throw exceptions(
> > >>>>>>>
> > >>>>>>>
> > >>>>
> > >>> https://github.com/apache/incubator-airflow/blob/master/airf
> > >> low/jobs.py#L2489
> > >>>>>>> )
> > >>>>>>> without rerunning the failed tasks.
> > >>>>>>>
> > >>>>>>> But since it broke some of the previous assumptions
for backfill,
> > >> we
> > >>>> would
> > >>>>>>> like to get some feedback and see if anyone has any
concerns(pr
> > >> could
> > >>>> be
> > >>>>>>> found at https://github.com/apache/incu
> > >> bator-airflow/pull/3464/files
> > >>> ).
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> -Tao
> > >>>>>>>
> > >>>>>>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin
<
> > >>>>>>> maximebeauchemin@gmail.com> wrote:
> > >>>>>>>
> > >>>>>>>> So I'm running a backfill for what feels like the
first time in
> > >>> years
> > >>>>>>> using
> > >>>>>>>> a simple `airflow backfill --local` commands.
> > >>>>>>>>
> > >>>>>>>> First I start getting a ton of `logging.info` of
each tasks
> that
> > >>>> cannot
> > >>>>>>> be
> > >>>>>>>> started just yet at every tick flooding my terminal
with the
> > >> keyword
> > >>>>>>>> `FAILED` in it, looking like a million of lines
like this one:
> > >>>>>>>>
> > >>>>>>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO
- Dependencies
> not
> > >>> met
> > >>>>>>> for
> > >>>>>>>> <TaskInstance: some_dag.some_task_id 2018-01-28
00:00:00
> > >>> [scheduled]>,
> > >>>>>>>> dependency 'Trigger Rule' FAILED: Task's trigger
rule
> > >> 'all_success'
> > >>> re
> > >>>>>>>> quires all upstream tasks to have succeeded, but
found 1
> > >>>> non-success(es).
> > >>>>>>>> upstream_tasks_state={'successes': 0L, 'failed':
0L,
> > >>>> 'upstream_failed':
> > >>>>>>>> 0L,
> > >>>>>>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other
> > >> _task_id']
> > >>>>>>>>
> > >>>>>>>> Good thing I triggered 1 month and not 2 years
like I actually
> > >> need,
> > >>>> just
> > >>>>>>>> the logs here would be "big data". Now I'm unclear
whether
> there's
> > >>>>>>> anything
> > >>>>>>>> actually running or if I did something wrong, so
I decide to
> kill
> > >>> the
> > >>>>>>>> process so I can set a smaller date range and get
a better
> picture
> > >>> of
> > >>>>>>>> what's up.
> > >>>>>>>>
> > >>>>>>>> I check my logging level, am I in DEBUG? Nope.
Just INFO. So I
> > >> take
> > >>> a
> > >>>>>>> note
> > >>>>>>>> that I'll need to find that log-flooding line and
demote it to
> > >> DEBUG
> > >>>> in a
> > >>>>>>>> quick PR, no biggy.
> > >>>>>>>>
> > >>>>>>>> Now I restart with just a single schedule, and
get an error `Dag
> > >>>>>>> {some_dag}
> > >>>>>>>> has reached maximum amount of 3 dag runs`. Hmmm,
I wish backfill
> > >>> could
> > >>>>>>> just
> > >>>>>>>> pickup where it left off. Maybe I need to run an
`airflow clear`
> > >>>> command
> > >>>>>>>> and restart? Ok, ran my clear command, same error
is showing up.
> > >>> Dead
> > >>>>>>> end.
> > >>>>>>>>
> > >>>>>>>> Maybe there is some new `airflow clear --reset-dagruns`
option?
> > >>>> Doesn't
> > >>>>>>>> look like it... Maybe `airflow backfill` has some
new switches
> to
> > >>>> pick up
> > >>>>>>>> where it left off? Can't find it. Am I supposed
to clear the DAG
> > >>> Runs
> > >>>>>>>> manually in the UI? This is a pre-production, in-development
> DAG,
> > >> so
> > >>>>>>> it's
> > >>>>>>>> not on the production web server. Am I supposed
to fire up my
> own
> > >>> web
> > >>>>>>>> server to go and manually handle the backfill-related
DAG Runs?
> > >>>> Cannot to
> > >>>>>>>> my staging MySQL and do manually clear some DAG
runs?
> > >>>>>>>>
> > >>>>>>>> So. Fire up a web server, navigate to my dag_id,
delete the DAG
> > >>> runs,
> > >>>> it
> > >>>>>>>> appears I can finally start over.
> > >>>>>>>>
> > >>>>>>>> Next thought was: "Alright looks like I need to
go Linus on the
> > >>>> mailing
> > >>>>>>>> list".
> > >>>>>>>>
> > >>>>>>>> What am I missing? I'm really hoping these issues
specific to
> > >> 1.8.2!
> > >>>>>>>>
> > >>>>>>>> Backfilling is core to Airflow and should work
very well. I want
> > >> to
> > >>>>>>> restate
> > >>>>>>>> some reqs for Airflow backfill:
> > >>>>>>>> * when failing / interrupted, it should seamlessly
be able to
> > >> pickup
> > >>>>>>> where
> > >>>>>>>> it left off
> > >>>>>>>> * terminal logging at the INFO level should be
a clear, human
> > >>>> consumable,
> > >>>>>>>> indicator of progress
> > >>>>>>>> * backfill-related operations (including restarts)
should be
> > >> doable
> > >>>>>>> through
> > >>>>>>>> CLI interactions, and not require web server interactions
as the
> > >>>> typical
> > >>>>>>>> sandbox (dev environment) shouldn't assume the
existence of a
> web
> > >>>> server
> > >>>>>>>>
> > >>>>>>>> Let's fix this.
> > >>>>>>>>
> > >>>>>>>> Max
> > >>>>>>>>
> > >>>>>>>
> > >>>>
> > >>>
> > >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message