airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yacine Chantit <Yacine.Chan...@thetrainline.com>
Subject Re: Airflow - High Availability and Scale Up vs Scale Out
Date Mon, 11 Jun 2018 08:26:50 GMT
We are using AWS ECS to deploy airflow and we rely on it to have some kind of high availability
and scaling workers.

We have defined 3 ECS services : scheduler / webserver / worker.

Scheduler and Webserver are each running on single container.

Worker service can scale to as many number of containers as we want we have actually 3 workers
running within the worker service.
We use ECS service scheduler to make sure there is always one airflow scheduler running in
fact we start the airflow scheduler with run-duration param to 10min so it gets restarted
continuously by ECS.

We have also defined a health check endpoints to check the health of all airflow processes.
For instance to check the health of the scheduler with a system health dag that spin 3 dummy
tasks that will write some logs on S3 and fire an event to newrelic. The scheduler healthcheck
endpoint will just check that there is a task instance log for the last dagrun and we use
newrelic sys_health events to define alerts. The healthcheck endpoints are used by ECS to
check that each of airflow ECS services are healthy.

We are also deploy our dags inside docker image when it's built so we have immutable image.
It's not ideal for small dag change to rebuild the image and redploy the whole airflow cluster
but it's is simpler enough than having to deal mounted volumes. We put our logs on S3 so we
don't mind killing containers so often.

It's working fine so far but we have just started (plan to migrate few hundres dags from another
workflow tools) and only have few dags on airflow so I don't know if we are going to keep
this once we have a few dozens dag changes everyday.

Regards,
Yacine

´╗┐On 10/06/2018, 21:04, "Ali Uz" <aliuz1@gmail.com> wrote:

    We also run one beefy box in AWS ECS with the scheduler and webserver
    running on the same container. However, we have run into issues with this
    approach as the scheduler does fail at times and our DAGs get stuck until I
    have to manually restart the container.
    What approaches do you guys use to restart the scheduler automatically when
    it's stuck and/or failed?

    - Ali

    On Sun, Jun 10, 2018 at 8:44 PM Bolke de Bruin <bdbruin@gmail.com> wrote:

    > If you are running on one big box, you most certainly want to put the
    > scheduler in its own cgroup and run the tasks with sudo it their own.
    > Otherwise your availability might suffer.
    >
    > B.
    >
    > Verstuurd vanaf mijn iPad
    >
    > > Op 10 jun. 2018 om 16:30 heeft Sam Sen <sxs@integrichain.com> het
    > volgende geschreven:
    > >
    > > Wouldn't you want immutable containers, hence, baking in the code in the
    > > container would be more ideal?
    > >
    > >> On Sun, Jun 10, 2018, 9:53 AM Arash Soheili <tonyarash@gmail.com>
    > wrote:
    > >>
    > >> We are just starting out but our setup is 2 EC2 with one running the web
    > >> server and scheduler and the other having multiple workers. The
    > database is
    > >> an RDS which both are connected to as well as Redis on AWS elastic cache
    > >> for the Celery connection.
    > >>
    > >> All 4 services are run in containers with systemd and we use CodeDeploy
    > and
    > >> sync up the code by mapping volumes from local file to the container. We
    > >> are not yet heavy users of Airflow so I can't speak to performance and
    > >> scale up just yet.
    > >>
    > >> In general I think an AMI with baked in code can be brittle and hard to
    > >> maintain and update. Container is the way to go as you can bake in the
    > code
    > >> in the image if you want. We have chosen not to do that and rely on
    > volume
    > >> mapping to update the latest code in the container. This makes it easier
    > >> that you don't need to keep creating new images.
    > >>
    > >> Arash
    > >>
    > >>> On Sat, Jun 9, 2018 at 9:47 AM Naik Kaxil <k.naik@reply.com> wrote:
    > >>>
    > >>> Let us know after trying the beefy box approach about your findings.
    > >>>
    > >>> On 08/06/2018, 12:24, "Sam Sen" <sxs@integrichain.com> wrote:
    > >>>
    > >>>    We are facing this now. We have tried the celeryexecutor and it adds
    > >>> more
    > >>>    moving parts. While we have no thrown out this idea, we are going
to
    > >>> give
    > >>>    one big beefy box a try.
    > >>>
    > >>>    To handle the HA side of things, we are putting the server in an
    > >>>    auto-scaling group (we use AWS) with a min and Max of 1 server. We
    > >>> deploy
    > >>>    from an AMI that has airflow baked in and we point the DB config
to
    > >> an
    > >>> RDS
    > >>>    using service discovery (consul).
    > >>>
    > >>>    As for the dag code, we can either bake it into the AMI as well or
    > >>> install
    > >>>    it on bootup. We haven't decided what to do for this but either way,
    > >> we
    > >>>    realize it could take a few minutes to fully recover in the event
of
    > >> a
    > >>>    catastrophe.
    > >>>
    > >>>    The other option is to have a standby server if using celery isn't
    > >>> ideal.
    > >>>    With that, I have tried using Hashicorp nomad to handle the
    > services.
    > >>> In my
    > >>>    limited trial, it did what we wanted but we need more time to test.
    > >>>
    > >>>>    On Fri, Jun 8, 2018, 4:23 AM Naik Kaxil <k.naik@reply.com>
wrote:
    > >>>>
    > >>>> Hi guys,
    > >>>>
    > >>>>
    > >>>>
    > >>>> I have 2 specific questions for the guys using Airflow in
    > >> production?
    > >>>>
    > >>>>
    > >>>>
    > >>>>   1. How have you achieved High availability? How does the
    > >>> architecture
    > >>>>   look like? Do you replicate the master node as well?
    > >>>>   2. Scale Up vs Scale Out?
    > >>>>      1. What is the preferred approach you take? 1 beefy Airflow
    > >> VM
    > >>> with
    > >>>>      Worker, Scheduler and Webserver using Local Executor or a
    > >>> cluster with
    > >>>>      multiple workers using Celery Executor.
    > >>>>
    > >>>>
    > >>>>
    > >>>> I think this thread should help others as well with similar
    > >> question.
    > >>>>
    > >>>>
    > >>>>
    > >>>>
    > >>>>
    > >>>> Regards,
    > >>>>
    > >>>> Kaxil
    > >>>>
    > >>>>
    > >>>>
    > >>>>
    > >>>> Kaxil Naik
    > >>>>
    > >>>> Data Reply
    > >>>> 2nd Floor, Nova South
    > >>>> 160 Victoria Street, Westminster
    > >>>> London SW1E 5LB - UK
    > >>>> phone: +44 (0)20 7730 6000 <+44%2020%207730%206000>
    > >>>> k.naik@reply.com
    > >>>> www.reply.com
    > >>>>
    > >>>> [image: Data Reply]
    > >>>>
    > >>>
    > >>>
    > >>>
    > >>>
    > >>>
    > >>>
    > >>> Kaxil Naik
    > >>>
    > >>> Data Reply
    > >>> 2nd Floor, Nova South
    > >>> 160 Victoria Street, Westminster
    > >>> London SW1E 5LB - UK
    > >>> phone: +44 (0)20 7730 6000 <+44%2020%207730%206000>
    > >>> k.naik@reply.com
    > >>> www.reply.com
    > >>>
    > >>
    >


The information in this email (and any attachments) is confidential and is intended solely
for the use of the individual or entity to whom it is addressed. If you received this email
in error please tell us by reply email (or telephone the sender) and delete all electronic
copies on your system or other copies known to you. Trainline Investments Holdings Limited
(Registered No.5776685), Trainline.com Limited (Registered No. 3846791) and Trainline International
Limited (Registered No. 6881309) are all registered in England and Wales with registered office
at 3rd floor, 120 Holborn, London, EC1N 2TD.
Mime
View raw message