airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-807) Scheduler is very slow when a .py file has many DAGs in it
Date Wed, 25 Jan 2017 21:59:26 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838667#comment-15838667
] 

ASF subversion and git services commented on AIRFLOW-807:
---------------------------------------------------------

Commit 6b2a3ca2ee4ee3415ef72ea1fa3fc694350e9efc in incubator-airflow's branch refs/heads/master
from [~criccomini]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=6b2a3ca ]

[AIRFLOW-807] Improve scheduler performance for large DAGs

MySQL's query optimizer selects the wrong index, this
has a significant impact on the performance of the
scheduler.

Closes #2021 from criccomini/AIRFLOW-807


> Scheduler is very slow when a .py file has many DAGs in it
> ----------------------------------------------------------
>
>                 Key: AIRFLOW-807
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-807
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.8.0b2
>            Reporter: Chris Riccomini
>            Assignee: Chris Riccomini
>             Fix For: Airflow 1.8
>
>
> While running Airflow 1.8.0b2 in production, we noticed a significant performance issue
with one of our DAGs.
> The .py file (called db.py) generates a bunch of DAGs. This file was taking > 900
seconds for the scheduler to process, which was introducing significant delays in our data
pipeline.
> We enabled slow_query log for MySQL, and saw that this query was taking more than 10
seconds per DAG in the .py file:
> {code:sql}
> SELECT task_instance.task_id AS task_id, max(task_instance.execution_date) AS max_ti

> FROM task_instance 
> WHERE task_instance.dag_id = 'dag1' AND task_instance.state = 'success' AND task_instance.task_id
IN ('t1', 't2') GROUP BY task_instance.task_id
> {code}
> This query is run inside jobs.py's manage_slas method. When running an explain, we can
see that MySQL is using the wrong index for it:
> {noformat}
> +----+-------------+---------------+------+----------------------------------------------------+----------+---------+-------+-------+--------------------------+
> | id | select_type | table         | type | possible_keys                           
          | key      | key_len | ref   | rows  | Extra                    |
> +----+-------------+---------------+------+----------------------------------------------------+----------+---------+-------+-------+--------------------------+
> |  1 | SIMPLE      | task_instance | ref  | PRIMARY,ti_dag_state,ti_pool,ti_state_lkp,ti_state
| ti_state | 63      | const | 81898 | Using where; Using index |
> +----+-------------+---------------+------+----------------------------------------------------+----------+---------+-------+-------+--------------------------+
> {noformat}
> It's using ti_state, but should be using ti_primary. We tried running ANALYZE/OPTIMIZE
on the {{task_instance}} table, but it didn't improve the query plan or performance time.
> Next, we added a hint to the SqlAlchemy query object, which improved the performance
by about 10x, dropping the db.py parsing down to 90 seconds.
> I then got another 2x boost by simply aborting the manage_slas method at the start if
the DAG has no tasks SLAs in it (none of our DAGs do). This dropped the db.py parse time to
45-50 seconds.
> This JIRA is to add a short circuit in manage_slas, and a hint for MySQL in the query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message