airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nadeem Ahmed Nazeer <naz...@neon-lab.com>
Subject airflow scheduler slowness as tasks increase
Date Wed, 13 Jul 2016 08:43:11 GMT
Hi,

We are using airflow to establish a data pipeline that runs tasks on
ephemeral amazon emr cluster. The oldest data we have is from 2014-05-26
which we have set as the start date with a scheduler interval of 1 day for
airflow.

We have an s3 copy task, a map reduce task and a bunch of hive and impala
load tasks in our DAG all run via PythonOperator. Our expectation is for
airflow to run each of these tasks for each day from the start date till
current date.

Just for numbers, the number of dags that got created were approximately
800 from start date till current date (2016-07-13). All is well at the
start of the execution but as it executes more and more tasks, the
scheduling of tasks starts slowing down. Looks like the scheduler is
spending lot of time in checking states and other houskeeping tasks.

One scheduler loop is taking almost 240 to 300 seconds due to the huge
number of tasks. It has been running my dags for over 24 hours now with
little progress. I am starting the scheduler process with restart for every
5 runs which is the default (airflow scheduler -n 5).

I did play around with different parallelism and config parameters without
much help. I am looking for some assistance on making scheduler quickly and
effectively schedule the tasks. Please help.

Configs :
parallelism = 32
dag_concurrency = 16
max_active_runs_per_dag = 99999
celeryd_concurrency = 16
scheduler_heartbeat_sec = 5

Thanks,
Nadeem

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message