airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Yang <yrql...@gmail.com>
Subject Re: A Naive Multi-Scheduler Architecture Experiment of Airflow
Date Wed, 31 Oct 2018 20:00:05 GMT
Finally we start to talk about this seriously? Yeah! :D

For your approach, a few thoughts:

   1. Shard by # of files may not yield same load--even very different load
   since we may have some framework DAG file producing 500 DAG and take
   forever to parse.
   2. I think Alex Guziel <https://github.com/saguziel> had previously
   talked about using apache helix to shard the scheduler. I haven't look a
   lot into it but may be something you're interested in. I personally like
   that idea because we don't need to reinvent the wheel about a lot stuff(
   less code to maintain also ;) ).
   3. About the DB part, I should be contributing back some changes that
   can dramatically drop the DB CPU usage. Afterwards I think we should have
   plenty of headroom( assuming the traffic is ~4000 DAG files and ~40k
   concurrency running task instances) so we should probly be fine here.

Also I'm kinda curious about your setup and want to understand why do you
need to shard the scheduler, since the scheduler can now scale up pretty
high actually.

Thank you for initiate the discussion, I think it can turn out to be a very
valuable and critical discussion--many people have been thinking/discussing
about this and I can't wait to hear the ideas :D

Cheers,
Kevin Y

On Wed, Oct 31, 2018 at 7:38 AM Deng Xiaodong <xd.deng.r@gmail.com> wrote:

> Hi Folks,
>
> Previously I initiated a discussion about the best practice of Airflow
> setting-up, and it was agreed by a few folks that scheduler may become one
> of the bottleneck component (we can only run one scheduler instance, can
> only scale vertically rather than horizontally, etc.). Especially when we
> have thousands of DAGs, the scheduling latency may be high.
>
> In our team, we have experimented a naive multiple-scheduler architecture.
> Would like to share here, and also seek inputs from you.
>
> **1. Background**
> - Inside DAG_Folder, we can have sub-folders.
> - When we initiate scheduler instance, we can specify “--subdir” for it,
> which will specify the specific directory that the scheduler is going to
> “scan” (https://airflow.apache.org/cli.html#scheduler).
>
> **2. Our Naive Idea**
> Say we have 2,000 DAGs. If we run one single scheduler instance, one
> scheduling loop will traverse all 2K DAGs.
>
> Our idea is:
> Step-1: Create multiple sub-directories, say five, under DAG_Folder
> (subdir1, subdir2, …, subdir5)
> Step-2: Distribute the DAGs evenly into these sub-directories (400 DAGs in
> each)
> Step-3: then we can start scheduler instance on 5 different machines,
> using command `airflow scheduler --subdir subdir<i>` on machine <i>.
>
> Hence eventually, each scheduler only needs to take care of 400 DAGs.
>
> **3. Test & Results**
> - We have done a testing using 2,000 DAGs (3 tasks in each DAG).
> - DAGs are stored using network attached storage (the same drive mounted
> to all nodes), so we don’t concern about the DAG_Folder synchronization.
> - No conflict observed (each DAG file will only be parsed & scheduled by
> one scheduler instance).
> - The scheduling speed improves almost linearly. Demonstrated that we can
> scale scheduler horizontally.
>
> **4. Highlight**
> - This naive idea doesn’t address scheduler availability.
> - As Kelvin Yang shared earlier in another thread, the database may be
> another bottleneck when the load is high. But this is not considered here
> yet.
>
>
> Kindly share your thoughts on this naive idea. Thanks.
>
>
>
> Best regards,
> XD
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message