airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Neiheisel <g...@astronomer.io>
Subject Re: Will redeploying webserver and scheduler in Kubernetes cluster kill running tasks
Date Thu, 30 Aug 2018 19:21:34 GMT
Yep, that should work fine. Pgbouncer is pretty configurable, so you can
play around with different settings for your environment. You can set
limits on the amount of connections you want to the actual database and
point your AIRFLOW__CORE__SQL_ALCHEMY_CONN to the pgbouncer service. In my
experience, you can get away with a pretty low amount of actual connections
to postgres. Pgbouncer has some tools to observe the count of clients
(airflow processes), the amount of actual connections to the database, as
well as the number of waiting clients. You should be able to tune your
max_connections to the point where you have little to no clients waiting,
but using a dramatically lower number of actual connections to postgres.

That chart also deploys a sidecar to pgbouncer that exports the metrics for
Prometheus to scrape. Here's an example Grafana dashboard that we use to
keep an eye on things -
https://github.com/astronomerio/astronomer/blob/master/docker/vendor/grafana/include/pgbouncer-stats.json
.

On Thu, Aug 30, 2018 at 2:26 PM Eamon Keane <eamon.keane1@gmail.com> wrote:

> Interesting, Greg. Do you know if using pg_bouncer would allow you to have
> more than 100 running k8s executor tasks at one time if e.g. there is a 100
> connection limit on gcp instance?
>
> On Thu, Aug 30, 2018 at 6:39 PM Greg Neiheisel <greg@astronomer.io> wrote:
>
> > Good point Eamon, maxing connections out is definitely something to look
> > out for. We recently added pgbouncer to our helm charts to pool
> connections
> > to the database for all the different airflow processes. Here's our chart
> > for reference -
> >
> >
> https://github.com/astronomerio/helm.astronomer.io/tree/master/charts/airflow
> >
> > On Thu, Aug 30, 2018 at 1:17 PM Kyle Hamlin <hamlin.kn@gmail.com> wrote:
> >
> > > Thanks for your responses! Glad to hear that tasks can run
> independently
> > if
> > > something happens.
> > >
> > > On Thu, Aug 30, 2018 at 1:13 PM Eamon Keane <eamon.keane1@gmail.com>
> > > wrote:
> > >
> > > > Adding to Greg's point, if you're using the k8s executor and for some
> > > > reason the k8s executor worker pod fails to launch within 120 seconds
> > > (e.g.
> > > > pending due to scaling up a new node), this counts as a task failure.
> > > Also,
> > > > if the k8s executor pod has already launched a pod operator but is
> > killed
> > > > (e.g. manually or due to node upgrade), the  pod operator it launched
> > is
> > > > not killed and runs to completion so if using retries, you need to
> > ensure
> > > > idempotency. The worker pods update the db per my understanding, with
> > > each
> > > > requiring a separate connection to the db, this can tax your
> connection
> > > > budget (100-300 for small postgres instances on gcp or aws).
> > > >
> > > > On Thu, Aug 30, 2018 at 6:04 PM Greg Neiheisel <greg@astronomer.io>
> > > wrote:
> > > >
> > > > > Hey Kyle, the task pods will continue to run even if you reboot the
> > > > > scheduler and webserver and the status does get updated in the
> > airflow
> > > > db,
> > > > > which is great.
> > > > >
> > > > > I know the scheduler subscribes to the Kubernetes watch API to get
> an
> > > > event
> > > > > stream of pods completing and it keeps a checkpoint so it can
> > > resubscribe
> > > > > when it comes back up.
> > > > >
> > > > > I forget if the worker pods update the db or if the scheduler is
> > doing
> > > > > that, but it should work out.
> > > > >
> > > > > On Thu, Aug 30, 2018, 9:54 AM Kyle Hamlin <hamlin.kn@gmail.com>
> > wrote:
> > > > >
> > > > > > gentle bump
> > > > > >
> > > > > > On Wed, Aug 22, 2018 at 5:12 PM Kyle Hamlin <hamlin.kn@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > I'm about to make the switch to Kubernetes with Airflow,
but am
> > > > > wondering
> > > > > > > what happens when my CI/CD pipeline redeploys the webserver
and
> > > > > scheduler
> > > > > > > and there are still long-running tasks (pods). My intuition
is
> > that
> > > > > since
> > > > > > > the database hold all state and the tasks are in charge
of
> > updating
> > > > > their
> > > > > > > own state, and the UI only renders what it sees in the
database
> > > that
> > > > > this
> > > > > > > is not so much of a problem. To be sure, however, here
are my
> > > > > questions:
> > > > > > >
> > > > > > > Will task pods continue to run?
> > > > > > > Can task pods continue to poll the external system they
are
> > running
> > > > > tasks
> > > > > > > on while being "headless"?
> > > > > > > Can the tasks pods change/update state in the database
while
> > being
> > > > > > > "headless"?
> > > > > > > Will the UI/Scheduler still be aware of the tasks (pods)
once
> > they
> > > > are
> > > > > > > live again?
> > > > > > >
> > > > > > > Is there anything else the might cause issues when deploying
> > while
> > > > > tasks
> > > > > > > (pods) are running that I'm not thinking of here?
> > > > > > >
> > > > > > > Kyle Hamlin
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Kyle Hamlin
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Kyle Hamlin
> > >
> >
> >
> > --
> > *Greg Neiheisel* / CTO Astronomer.io
> >
>


-- 
*Greg Neiheisel* / CTO Astronomer.io

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message