airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bence Nagy (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AIRFLOW-57) Rename concurrency configuration variables to be more clear
Date Fri, 06 May 2016 13:30:12 GMT

     [ https://issues.apache.org/jira/browse/AIRFLOW-57?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Bence Nagy updated AIRFLOW-57:
------------------------------
    Description: 
Currently the following config variables exists for controlling parallel execution limits:

{code}
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32

# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16

# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128

# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
{code}

Let's go through these one by one:

{{parallelism}}: not a very descriptive name, considering that all the above settings are
for parallelism. The description says it sets the maximum task instances for the airflow installation,
which is a bit ambiguous — if I have two hosts running airflow workers, I'd have airflow
installed on two hosts, so that should be two installations, but based on context 'per installation'
here means 'per Airflow state database'. I'd name this {{max_active_tasks}}.

{{dag_concurrency}}: Despite the name based on the comment this is actually the task concurrency,
and it's per worker. I'd name this {{max_active_tasks_for_worker}} ({{per_worker}} would suggest
that it's a global setting for workers, but I think you can have workers with different values
set for this).

{{non_pooled_task_slot_count}}: This is a weird one. I'm going to pass on suggesting a name
for it because I just can't think of any reason this config variable should exist. We already
have a global task instance limit, and we have pools to limit access to certain resources
— in what case would someone want to limit access to everything other than certain resources?
So, yeah, skipping this one. In case this was needed only due to how pools are implemented,
I'd suggest setting the limit to {{sys.maxsize}} and just removing the config variable.

{{max_active_runs_per_dag}}: This one's kinda alright, but since it seems to be just a default
value for the matching {{DAG}} kwarg, it might be nice to reflect that in the name, something
like {{default_max_active_runs_for_dags}}

So let's move on to the {{DAG}} kwargs:

{{concurrency}}: Again, having a general name like this, coupled with the fact that concurrency
is used for something different elsewhere makes this pretty confusing. I'd call this {{max_active_tasks}}.

{{max_active_runs}}: This one sounds alright to me.

So. If people agree that this is something that should be fixed, I think it'd be nice to get
this in the 1.7.1 release, especially considering that it should be really easy to make the
change backwards compatible.

  was:
Currently the following config variables exists for controlling parallel execution limits:

{code}
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32

# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16

# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128

# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16
{code}

Let's go through these one by one:

{{parallelism}}: not a very descriptive name, considering that all the above settings are
for parallelism. The description says it sets the maximum task instances for the airflow installation,
which is a bit ambiguous — if I have two hosts running airflow workers, I'd have airflow
installed on two hosts, so that should be two installations, but based on context 'per installation'
here means 'per Airflow state database'. I'd name this {{max_active_tasks}}.

{{dag_concurrency}}: Despite the name based on the comment this is actually the task concurrency,
and it's per worker. I'd name this {{max_active_tasks_for_worker}} ({{per_worker}} would suggest
that it's a global setting for workers, but I think you can have workers with different values
set for this).

{{non_pooled_task_slot_count}}: This is a weird one. I'm going to pass on suggesting a name
for it because I just can't think of any reason this config variable should exist. We already
have a global task instance limit, and we have pools to limit access to certain resources
— in what case would someone want to limit access to everything other than certain resources?
So, yeah, skipping this one.

{{max_active_runs_per_dag}}: This one's kinda alright, but since it seems to be just a default
value for the matching {{DAG}} kwarg, it might be nice to reflect that in the name, something
like {{default_max_active_runs_for_dags}}

So let's move on to the {{DAG}} kwargs:

{{concurrency}}: Again, having a general name like this, coupled with the fact that concurrency
is used for something different elsewhere makes this pretty confusing. I'd call this {{max_active_tasks}}.

{{max_active_runs}}: This one sounds alright to me.

So. If people agree that this is something that should be fixed, I think it'd be nice to get
this in the 1.7.1 release, especially considering that it should be really easy to make the
change backwards compatible.


> Rename concurrency configuration variables to be more clear
> -----------------------------------------------------------
>
>                 Key: AIRFLOW-57
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-57
>             Project: Apache Airflow
>          Issue Type: Improvement
>    Affects Versions: Airflow 1.7.0
>            Reporter: Bence Nagy
>            Priority: Minor
>              Labels: newbie
>
> Currently the following config variables exists for controlling parallel execution limits:
> {code}
> # The amount of parallelism as a setting to the executor. This defines
> # the max number of task instances that should run simultaneously
> # on this airflow installation
> parallelism = 32
> # The number of task instances allowed to run concurrently by the scheduler
> dag_concurrency = 16
> # When not using pools, tasks are run in the "default pool",
> # whose size is guided by this config element
> non_pooled_task_slot_count = 128
> # The maximum number of active DAG runs per DAG
> max_active_runs_per_dag = 16
> {code}
> Let's go through these one by one:
> {{parallelism}}: not a very descriptive name, considering that all the above settings
are for parallelism. The description says it sets the maximum task instances for the airflow
installation, which is a bit ambiguous — if I have two hosts running airflow workers, I'd
have airflow installed on two hosts, so that should be two installations, but based on context
'per installation' here means 'per Airflow state database'. I'd name this {{max_active_tasks}}.
> {{dag_concurrency}}: Despite the name based on the comment this is actually the task
concurrency, and it's per worker. I'd name this {{max_active_tasks_for_worker}} ({{per_worker}}
would suggest that it's a global setting for workers, but I think you can have workers with
different values set for this).
> {{non_pooled_task_slot_count}}: This is a weird one. I'm going to pass on suggesting
a name for it because I just can't think of any reason this config variable should exist.
We already have a global task instance limit, and we have pools to limit access to certain
resources — in what case would someone want to limit access to everything other than certain
resources? So, yeah, skipping this one. In case this was needed only due to how pools are
implemented, I'd suggest setting the limit to {{sys.maxsize}} and just removing the config
variable.
> {{max_active_runs_per_dag}}: This one's kinda alright, but since it seems to be just
a default value for the matching {{DAG}} kwarg, it might be nice to reflect that in the name,
something like {{default_max_active_runs_for_dags}}
> So let's move on to the {{DAG}} kwargs:
> {{concurrency}}: Again, having a general name like this, coupled with the fact that concurrency
is used for something different elsewhere makes this pretty confusing. I'd call this {{max_active_tasks}}.
> {{max_active_runs}}: This one sounds alright to me.
> So. If people agree that this is something that should be fixed, I think it'd be nice
to get this in the 1.7.1 release, especially considering that it should be really easy to
make the change backwards compatible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message