airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <lance.nors...@gmail.com>
Subject Re: Restarting the scheduler regularly - still current advice?
Date Tue, 09 Aug 2016 19:44:45 GMT
We are on 1.6.2 and would love to upgrade to a modern version. We were
holding out for the first Apache release.

Also, we have cases where the various concurrent task limits are ignored
and we have 50 tasks scheduled at once. A DAG like this:

dag = DAG(
    dag_id='xxx', x
    schedule_interval="0 "+str(hour)+" * * *",
    max_active_runs=1,
    concurrency=1
    )

and where all of the tasks use the same Pool. max_active_runs= is ignored,
concurrency= is ignored, the Pool is ignored.


On Tue, Aug 9, 2016 at 12:08 PM, Bolke de Bruin <bdbruin@gmail.com> wrote:

> I disagree. Num_runs should NOT be used anymore and I would really like to
> know ‘stuck’ schedulers on release or on master, preferably with celery
> executor (LocalExecutor can sometimes look stuck but isn’t). Restarting
> should only be required for clearing up database connections as we are not
> very good at that yet.
>
> - Bolke
>
> > Op 9 aug. 2016, om 20:30 heeft Lance Norskog <lance.norskog@gmail.com>
> het volgende geschreven:
> >
> > Yes, it is still current advice.
> >
> > My experience is that after running for (let's say) days, the app
> develops
> > memory corruption. I've seen three different ways that memory corruption
> > shows up. The scheduler failure is just one of these three symptoms.
> >
> > The other two symptoms are
> > 1) the main page of the UI shows a different list of running DAGs than is
> > what is really configured,
> > 2) a task contains some configuration data that should be in a
> neighboring
> > task, and fails.
> >
> > Frankly, I would configure all 5 daemons to restart periodically, not
> just
> > the scheduler daemon.
> >
> >
> > On Tue, Aug 9, 2016 at 8:50 AM, Andrew Phillips <andrewp@apache.org>
> wrote:
> >
> >> Hi all
> >>
> >> I just wanted to check to what extent the advice in [1] and [2], namely
> to
> >> restart the scheduler "every once in a while", is still considered
> accurate?
> >>
> >> "Restart your scheduler process to get a clean environment every once
> in a
> >> while. Use --num_runs N scheduler CLI option to make it stop after N
> runs
> >> and have some supervisor ensuring it is always running. See issue 698"
> >>
> >> "The scheduler should be restarted frequently
> >>
> >> In our experience, a long running scheduler process, at least with the
> >> CeleryExecutor, ends up not scheduling some tasks. We still don’t know
> the
> >> exact cause, unfortunately.
> >>
> >> Fortunately, airflow has a built-in workaround in the form of the —
> >> num_runs flag. It specifies a number of iterations for the scheduler to
> run
> >> of its loop before it quits. We’re running it with 10 iterations, Airbnb
> >> runs it with 5. Note that this will cause problems when using the
> >> LocalExecutor."
> >>
> >> Both documents are pretty now, so I assume this is considered still
> >> relevant. Could you give some guidance on what kind of frequency is
> >> recommended here, or is that very dependent on the particular
> installation?
> >>
> >> Also, which of the current JIRA issues (if any) is the new version of
> >> "issue 698" as mentioned in the first quote? There seem to be quite a
> few
> >> issues relating to the scheduler getting stuck [3] - which one(s)
> should we
> >> follow and/or add information to to best track progress on this topic?
> >>
> >> Thanks!
> >>
> >> ap
> >>
> >>
> >> [1] https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls
> >> [2] https://medium.com/handy-tech/airflow-tips-tricks-and-pitfal
> >> ls-9ba53fba14eb#.ahcprdr9r
> >> [3] https://issues.apache.org/jira/browse/AIRFLOW-39?jql=project
> >> %20%3D%20AIRFLOW%20AND%20text%20~%20%22scheduler%22
> >>
> >
> >
> >
> > --
> > Lance Norskog
> > lance.norskog@gmail.com
> > Redwood City, CA
>
>


-- 
Lance Norskog
lance.norskog@gmail.com
Redwood City, CA

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message