airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremiah Lowin <jlo...@apache.org>
Subject Re: Airflow kubernetes executor
Date Thu, 13 Jul 2017 12:24:29 GMT
p.s. it looks like git-sync has received an "official" release since the
last time I looked at it: https://github.com/kubernetes/git-sync

On Thu, Jul 13, 2017 at 8:18 AM Jeremiah Lowin <jlowin@apache.org> wrote:

> Hi Gerard (and anyone else for whom this might be helpful),
>
> We've run Airflow on GCP for a few years. The structure has changed over
> time but at the moment we use the following basic outline:
>
> 1. Build a container that includes all Airflow and DAG dependencies and
> push it to Google container registry. If you need to add/update
> dependencies or update airflow.cfg, simply push a new image
> 2. All DAGs are pushed to a git repo
> 3. Host the AirflowDB in Google Cloud SQL
> 4. Create a Kuberenetes deployment that runs the following containers:
> -- Airflow scheduler (using the dependencies image)
> -- Airflow webserver (using the dependencies image)
> -- Airflow maintainence (using the dependencies image) - this container
> does nothing (sleep infinity) but since it shares the same setup as the
> scheduler/webserver, it's an easy place to `exec` into the cluster to
> investigate any issues that might be crashing the main containers. We limit
> its CPU to minimize impact on cluster resources. Hacky but effective.
> -- cloud sql proxy (https://cloud.google.com/sql/docs/postgres/sql-proxy)
> - to connect to the Airflow DB
> -- git-sync (https://github.com/jlowin/git-sync)
>
> The last container (git-sync) is a small library I wrote to solve the
> issue of syncing DAGs. It's not perfect and ***I am NOT offering any
> support for it*** but it gets the job done. It's meant to be a sidecar
> container and does one thing: constantly fetch a git repo to a local
> folder. In your deployment, create an EmptyDir volume and mount it in all
> containers (except cloud sql). Git-sync should use that volume as its
> target, and scheduler/webserver should use the volume as the DAGs folder.
> That way, every 30 seconds, git-sync will fetch the git repo in to that
> volume, and the Airflow containers will immediately see the latest files
> appear.
>
> 5. Create a Kubernetes service to expose the webserver UI
>
> Our actual implementation is considerably more complicated than this since
> we have extensive custom modules that are loaded via git-sync rather than
> being baked into the image, as well as a few other GCP service
> integrations, but this overview should point in the right direction.
> Getting it running the first time requires a little elbow grease but once
> built, it's easy to automate the process.
>
> Best,
> Jeremiah
>
>
>
> On Thu, Jul 13, 2017 at 3:50 AM Gerard Toonstra <gtoonstra@gmail.com>
> wrote:
>
>> It would be really good if you'd share experiences on how to run this on
>> kubernetes and ECS.
>> I'm not aware of a good guide on how to run this on either for example,
>> but
>> it's a very useful and
>> quick setup to start with, especially combining that with deployment
>> manager and cloudformation (probably).
>>
>> I'm talking to someone else who's looking at running on kubernetes and
>> potentially opensourcing a generic
>> template for kubernetes deployments.
>>
>>
>> Would it be possible to share your experiences?  What tech are you using
>> for specific issues?
>>
>> - how do you deploy and sync dags?  Are you using EFS?
>> - how you do build the container with airflow + executables?
>> - where do you send log files or log lines to?
>> - High Availability and how?
>>
>> Really looking forward to how that's done, so we can put this on the wiki.
>>
>> Especially since GCP is now also starting to embrace airflow, it'd be good
>> to have a better understanding
>> how easy and quick it can be to deploy airflow on gcp:
>>
>>
>> https://cloud.google.com/blog/big-data/2017/07/how-to-aggregate-data-for-bigquery-using-apache-airflow
>>
>> Rgds,
>>
>> Gerard
>>
>>
>> On Wed, Jul 12, 2017 at 8:55 PM, Arthur Purvis <apurvis@lumoslabs.com>
>> wrote:
>>
>> > for what it's worth we've been running airflow on ECS for a few years
>> > already.
>> >
>> > On Wed, Jul 12, 2017 at 12:21 PM, Grant Nicholas <
>> > grantnicholas2015@u.northwestern.edu> wrote:
>> >
>> > > Is having a static set of workers necessary? Launching a job on
>> > Kubernetes
>> > > from a cached docker image takes a few seconds max. I think this is an
>> > > acceptable delay for a batch processing system like airflow.
>> > >
>> > > Additionally, if you dynamically launch workers you can start
>> dynamically
>> > > launching *any type* of worker and you don't have to statically
>> allocate
>> > > pools of worker types. IE) A single DAG could use a scala docker
>> image to
>> > > do spark calculations, a C++ docker image to use some low level
>> numerical
>> > > library,  and a python docker image by default to do any generic
>> airflow
>> > > stuff. Additionally, you can size workers according to their usage.
>> Maybe
>> > > the spark driver program only needs a few GBs of RAM but the C++
>> > numerical
>> > > library needs many hundreds.
>> > >
>> > > I agree there is a bit of extra book-keeping that needs to be done,
>> but
>> > > the tradeoff is an important one to explicitly make.
>> > >
>> >
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message