airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Vaughan (JIRA)" <>
Subject [jira] [Created] (AIRFLOW-1139) Scheduler runs very slowly when many DAGs in DAG directory
Date Fri, 21 Apr 2017 18:09:04 GMT
David Vaughan created AIRFLOW-1139:

             Summary: Scheduler runs very slowly when many DAGs in DAG directory
                 Key: AIRFLOW-1139
             Project: Apache Airflow
          Issue Type: Improvement
    Affects Versions: 1.8.0
         Environment: macOS Sierra, v10.12.2, MacBook Pro, 2.5 GHz Intel Core i7, 16 GB RAM
            Reporter: David Vaughan
            Priority: Minor

When we have several (10-15) DAGs in our DAG directory, and each of them is pretty large (~900
tasks on average), Airflow's periodic re-processing of the DAGs in our DAG directory takes
a long time and takes resources away from running DAGs.

Almost always we only have one DAG actually running at any given time, and the rest are paused.
The one running DAG, however, crawls along noticeably slower than if we only have one or two
DAGs total in the DAG directory.

I think it would be nice to have an option to turn off re-processing of DAGs completely, after
the initial processing.

The way we use Airflow right now, we don't edit our existing DAGs frequently, so we have no
need for periodic refresh. We have experimented with the min_file_process_interval option
in airflow.cfg, but setting it to small numbers causes no noticeable change, and setting it
to very large numbers (to emulate not refreshing at all) actually causes the DAG to run much
slower than it already was.

Is anybody else still experiencing this? Are there existing ways to avoid this problem? Here
are some links to people referencing, I believe, this same issue, but they're all from last

Thanks in advance for any discussion or help.

This message was sent by Atlassian JIRA

View raw message