airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bolke de Bruin <bdbr...@gmail.com>
Subject Speeding up the scheduler - request for comments
Date Fri, 03 Jun 2016 21:26:39 GMT
Hi,

I am looking at speeding up the scheduler. Currently loop times increase with the amount of
tasks in a dag. This is due to TaskInstance.are_depedencies_met executing several aggregation
functions on the database. These calls are expensive: between 0.05-0.15s per task and for
every scheduler loop this gets called twice. This call is where the scheduler spends around
90% of its time when evaluating dags and is the reason for people that have a large amount
of tasks per dag to so quite large loop times (north of 600s).

I see 2 options to optimize the loop without going to a multiprocessing approach which will
just put the problem down the line (ie. the db or when you don’t have enough cores anymore).

1. Cache the call to TI.are_dependencies_met by either caching in a something like memcache
or removing the need for the double call (update_state and process_dag both make the call
to TI.are_dependencies_met). This would more or less cut the time in half.

2. Notify the downstream tasks of a state change of a upstream task. This would remove the
need for the aggregation as the task would just ‘know’. It is a bit harder to implement
correctly as you need to make sure you keep being in a consistent state. Obviously you could
still run a integrity check once in a while. This option would make the aggregation event
based and significantly reduce the time spend here to around 1-5% of the current scheduler.
There is a slight overhead added at a state change of the TaskInstance (managed by the TaskInstance
itself).

What do you think? My preferred option is #2. Am i missing any other options? Are scheduler
loop times a concern at all?

Thanks
Bolke



Mime
View raw message