airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Bacon (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (AIRFLOW-57) Rename concurrency configuration variables to be more clear
Date Tue, 09 Jan 2018 19:05:00 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-57?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318964#comment-16318964
] 

Josh Bacon edited comment on AIRFLOW-57 at 1/9/18 7:04 PM:
-----------------------------------------------------------

FWIW and FYI, regarding +DAG_CONCURRENCY+ and the statement_ "per_worker would suggest that
it's a global setting for workers, but I think you can have workers with different values
set for this"_. My tests using celery workers have shown that individual workers +*cannot*+
override/change this concurrency "per-worker", the scheduler's configuration is always applied
instead. In my tests CELERYD_CONCURRENCY was increased on my workers to run more concurrent
tasks, but no tasks were sent from the scheduler to fill this concurrency (which was limited
by a lower DAG_CONCURRENCY).


was (Author: jbacon):
FWIW and FYI, regarding +DAG_CONCURRENCY+ and the statement_ "per_worker would suggest that
it's a global setting for workers, but I think you can have workers with different values
set for this"_. My tests using celery workers have shown that individual workers +*cannot*+
override/change this concurrency "per-worker", the scheduler's configuration is always applied
instead. In my tests AIRLFOW__CELERY__CELERYD_CONCURRENCY was increased on my workers to run
more concurrent tasks, but no tasks were sent from the scheduler to fill this concurrency
(which was limited by a lower AIRFLOW__CORE__DAG_CONCURRENCY).

> Rename concurrency configuration variables to be more clear
> -----------------------------------------------------------
>
>                 Key: AIRFLOW-57
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-57
>             Project: Apache Airflow
>          Issue Type: Improvement
>    Affects Versions: Airflow 1.7.0
>            Reporter: Bence Nagy
>            Priority: Minor
>              Labels: easyfix, newbie
>
> Currently the following config variables exists for controlling parallel execution limits:
> {code}
> # The amount of parallelism as a setting to the executor. This defines
> # the max number of task instances that should run simultaneously
> # on this airflow installation
> parallelism = 32
> # The number of task instances allowed to run concurrently by the scheduler
> dag_concurrency = 16
> # When not using pools, tasks are run in the "default pool",
> # whose size is guided by this config element
> non_pooled_task_slot_count = 128
> # The maximum number of active DAG runs per DAG
> max_active_runs_per_dag = 16
> {code}
> Let's go through these one by one:
> {{parallelism}}: not a very descriptive name, considering that all the above settings
are for parallelism. The description says it sets the maximum task instances for the airflow
installation, which is a bit ambiguous — if I have two hosts running airflow workers, I'd
have airflow installed on two hosts, so that should be two installations, but based on context
'per installation' here means 'per Airflow state database'. I'd name this {{max_active_tasks}}.
> {{dag_concurrency}}: Despite the name based on the comment this is actually the task
concurrency, and it's per worker. I'd name this {{max_active_tasks_for_worker}} ({{per_worker}}
would suggest that it's a global setting for workers, but I think you can have workers with
different values set for this).
> {{non_pooled_task_slot_count}}: This is a weird one. I'm going to pass on suggesting
a name for it because I just can't think of any reason this config variable should exist.
We already have a global task instance limit, and we have pools to limit access to certain
resources — in what case would someone want to limit access to everything other than certain
resources? So, yeah, skipping this one. In case this was needed only due to how pools are
implemented, I'd suggest setting the limit to {{sys.maxsize}} and just removing the config
variable.
> {{max_active_runs_per_dag}}: This one's kinda alright, but since it seems to be just
a default value for the matching {{DAG}} kwarg, it might be nice to reflect that in the name,
something like {{default_max_active_runs_for_dags}}
> So let's move on to the {{DAG}} kwargs:
> {{concurrency}}: Again, having a general name like this, coupled with the fact that concurrency
is used for something different elsewhere makes this pretty confusing. I'd call this {{max_active_tasks}}.
> {{max_active_runs}}: This one sounds alright to me.
> So. If people agree that this is something that should be fixed, I think it'd be nice
to get this in the 1.7.1 release, especially considering that it should be really easy to
make the change backwards compatible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message