airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel (Daniel Lamblin) [BDP - Seoul]" <lamb...@coupang.com>
Subject Re: A Naive Multi-Scheduler Architecture Experiment of Airflow
Date Fri, 09 Nov 2018 12:48:30 GMT
The missing points you brought up, yes that was one of the reasons it seemed like getting zookeeper
or a DB coordinated procedure involved to both count and number the schedulers and mark one
of them the lead. Locking each dag file for processing sounds easier, but we were seeing update
transactions fail already without adding more pressure on the DB. Locking is another thing
zk can handle. But adding zk seems like such deployment overhead that scheduler type like
executor type needs to become a modular option in the process of the change.

The ignore pattern method that was in use described earlier was basically adding an entry
to a top level .airflowignore file via a flag or env instead of making the file. Now that
was simple, with all the drawbacks mentioned already.

Get Outlook for Android<https://aka.ms/ghei36>

________________________________
From: Maxime Beauchemin <maximebeauchemin@gmail.com>
Sent: Friday, November 9, 2018 5:03:02 PM
To: dev@airflow.incubator.apache.org
Cc: dev@airflow.apache.org; yrqls21@gmail.com
Subject: Re: A Naive Multi-Scheduler Architecture Experiment of Airflow

[CAUTION]: This email originated from outside of the organization. Do not click links or open
attachments unless you recognize the sender and know the content is safe.
[주의]: 본 이메일은 회사 외부에서 유입된 이메일입니다. 발신자의
신원과 이메일 내용이 안전한지 확인하기 전까지는 이메일에 포함된
링크를 클릭하거나 첨부파일을 열지 마십시오.


I mean at that point it's just as easy (or easier) to do things properly:
get the scheduler subprocesses to take a lock on the DAG it's about to
process, and release it when it's done. Add a lock timestamp and bit of
logic to expire locks (to self heal if the process ever crashed and failed
at releasing the lock). Of course make sure that the
confirm-its-not-locked-and-take-a-lock process is insulated in a database
transaction, and your'e mostly good. That sounds like a very easy thing to
do.

The only thing that's missing at that point to fully support
multi-schedulers is to centralize the logic that does the prioritization
and pushing to workers. That's a bit more complicated, it assumes a leader
(and leader election), and to change the logic of how individual
"DAG-evaluator processes" communicate what task instances are runnable to
that leader (over a message queue? over the database?).

Max

On Thu, Nov 8, 2018 at 10:02 AM Daniel (Daniel Lamblin) [BDP - Seoul] <
lamblin@coupang.com> wrote:

> Since you're discussing multi-scheduler trials,
> Based on v1.8 we have also tried something, based on passing in a regex to
> each scheduler; DAG file paths which match it are ignored. This required
> turning off some logic that deletes dag data for dags that are missing from
> the dagbag.
> It is pretty manual and not evenly distributed, but it allows some 5000+
> DAGs or so with 6 scheduler instances. That said there's some pain around
> maintaining such a setup, so we didn't opt for it (yet) in our v1.10 setup.
> The lack of cleaning up an old dag name is also not great (it can be done
> semi manually). Then there's the work in trying to redefine patterns for
> better mixes, testing that patterns don't all ignore the same file, nor
> that more than one scheduler includes the same file. I generally wouldn't
> suggest this approach.
>
> In considering to setup a similar modification to v1.10, we thought it
> would make sense to instead tell each scheduler which scheduler number it
> is, and how many total schedulers there are. Then each scheduler can use
> some hash (cityhash?) on the whole py file path, mod it by the scheduler
> count, and only parse it if it matches its scheduler number.
>
> This seemed like a good way to keep a fixed number of schedulers balancing
> new dag files, but we didn't do it (yet) because we started to think about
> getting fancier: what if a scheduler needs to be added? Can it be done
> without stopping the others and update the total count; or vice-versa for
> removing a scheduler. If one scheduler drops out can the others renumber
> themselves? If that could be solved, then the schedulers could be made into
> an autoscaling group… For this we thought about wrapping the whole
> scheduler instance's process up in some watchdog that might coordinate with
> something like zookeeper (or by using the existing airflow DB) but it got
> to be full of potential loopholes for the schedulers, like needing to be in
> sync about refilling the dagbag in concert with each other when there's a
> change in the total count, and problems when one drops off but is actually
> not really down for the count and pops back in having missed that the
> others decided changed their numbering, etc.
>
> I bring this up because the basic form of the ideas doesn't hinge on which
> folder a dag is in, which seems more likely to work nicely with team based
> hierarchies which also import reusable modules across DAG files.
> -Daniel
> P.S. yeah we did find there were times when schedulers exited because
> there was a db lock on task instances they were trying to update. So the DB
> needs to be managed by someone who knows how to scale it for that… or
> possibly the model needs to be made more conducive to minimally locking
> updates.
>
> On 10/31/18, 11:38 PM, "Deng Xiaodong" <xd.deng.r@gmail.com> wrote:
>
>     Hi Folks,
>
>     Previously I initiated a discussion about the best practice of Airflow
> setting-up, and it was agreed by a few folks that scheduler may become one
> of the bottleneck component (we can only run one scheduler instance, can
> only scale vertically rather than horizontally, etc.). Especially when we
> have thousands of DAGs, the scheduling latency may be high.
>
>     In our team, we have experimented a naive multiple-scheduler
> architecture. Would like to share here, and also seek inputs from you.
>
>     *1. Background*
>     - Inside DAG_Folder, we can have sub-folders.
>     - When we initiate scheduler instance, we can specify “--subdir” for
> it, which will specify the specific directory that the scheduler is going
> to “scan” (https://airflow.apache.org/cli.html#scheduler).
>
>     *2. Our Naive Idea*
>     Say we have 2,000 DAGs. If we run one single scheduler instance, one
> scheduling loop will traverse all 2K DAGs.
>
>     Our idea is:
>     Step-1: Create multiple sub-directories, say five, under DAG_Folder
> (subdir1, subdir2, …, subdir5)
>     Step-2: Distribute the DAGs evenly into these sub-directories (400
> DAGs in each)
>     Step-3: then we can start scheduler instance on 5 different machines,
> using command `airflow scheduler --subdir subdir<i>` on machine <i>.
>
>     Hence eventually, each scheduler only needs to take care of 400 DAGs.
>
>     *3. Test & Results*
>     - We have done a testing using 2,000 DAGs (3 tasks in each DAG).
>     - DAGs are stored using network attached storage (the same drive
> mounted to all nodes), so we don’t concern about the DAG_Folder
> synchronization.
>     - No conflict observed (each DAG file will only be parsed & scheduled
> by one scheduler instance).
>     - The scheduling speed improves almost linearly. Demonstrated that we
> can scale scheduler horizontally.
>
>     *4. Highlight*
>     - This naive idea doesn’t address scheduler availability.
>     - As Kelvin Yang shared earlier in another thread, the database may be
> another bottleneck when the load is high. But this is not considered here
> yet.
>
>
>     Kindly share your thoughts on this naive idea. Thanks.
>
>
>
>     Best regards,
>     XD
>
>
>
>
>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message