airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deng Xiaodong <xd.den...@gmail.com>
Subject A Naive Multi-Scheduler Architecture Experiment of Airflow
Date Wed, 31 Oct 2018 14:38:25 GMT
Hi Folks,

Previously I initiated a discussion about the best practice of Airflow setting-up, and it
was agreed by a few folks that scheduler may become one of the bottleneck component (we can
only run one scheduler instance, can only scale vertically rather than horizontally, etc.).
Especially when we have thousands of DAGs, the scheduling latency may be high.

In our team, we have experimented a naive multiple-scheduler architecture. Would like to share
here, and also seek inputs from you.

*1. Background*
- Inside DAG_Folder, we can have sub-folders.
- When we initiate scheduler instance, we can specify “--subdir” for it, which will specify
the specific directory that the scheduler is going to “scan” (https://airflow.apache.org/cli.html#scheduler).

*2. Our Naive Idea*
Say we have 2,000 DAGs. If we run one single scheduler instance, one scheduling loop will
traverse all 2K DAGs.

Our idea is:
Step-1: Create multiple sub-directories, say five, under DAG_Folder (subdir1, subdir2, …,
subdir5)
Step-2: Distribute the DAGs evenly into these sub-directories (400 DAGs in each)
Step-3: then we can start scheduler instance on 5 different machines, using command `airflow
scheduler --subdir subdir<i>` on machine <i>.

Hence eventually, each scheduler only needs to take care of 400 DAGs.

*3. Test & Results*
- We have done a testing using 2,000 DAGs (3 tasks in each DAG).
- DAGs are stored using network attached storage (the same drive mounted to all nodes), so
we don’t concern about the DAG_Folder synchronization.
- No conflict observed (each DAG file will only be parsed & scheduled by one scheduler
instance).
- The scheduling speed improves almost linearly. Demonstrated that we can scale scheduler
horizontally.

*4. Highlight*
- This naive idea doesn’t address scheduler availability.
- As Kelvin Yang shared earlier in another thread, the database may be another bottleneck
when the load is high. But this is not considered here yet.


Kindly share your thoughts on this naive idea. Thanks.



Best regards,
XD





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message