airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Darren Weber (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency
Date Thu, 15 Aug 2019 01:57:00 GMT

     [ https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Darren Weber updated AIRFLOW-5218:
----------------------------------
    Description: 
The AWS Batch Operator attempts to use a boto3 feature that is not available and has not been
merged in years, see
 - [https://github.com/boto/botocore/pull/1307]
 - see also [https://github.com/broadinstitute/cromwell/issues/4303]

This is a curious case of premature optimization. So, in the meantime, this means that the
fallback is the exponential backoff routine for the status checks on the batch job. Unfortunately,
when the concurrency of Airflow jobs is very high (100's of tasks), this fallback polling
hits the AWS Batch API too hard and the AWS API throttle throws an error, which fails the
Airflow task, simply because the status is polled too frequently.

Check the output from the retry algorithm, e.g. within the first 10 retries, the status of
an AWS batch job is checked about 10 times at a rate that is approx 1 retry/sec. When an Airflow
instance is running 10's or 100's of concurrent batch jobs, this hits the API too frequently
and crashes the Airflow task (plus it occupies a worker in too much busy work).
{code:java}
In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
 Out[4]: 
 [1.0,
 1.01,
 1.04,
 1.09,
 1.1600000000000001,
 1.25,
 1.36,
 1.4900000000000002,
 1.6400000000000001,
 1.81,
 2.0,
 2.21,
 2.4400000000000004,
 2.6900000000000004,
 2.9600000000000004,
 3.25,
 3.5600000000000005,
 3.8900000000000006,
 4.24,
 4.61]{code}
Possible solutions are to introduce an initial sleep (say 60 sec?) right after issuing the
request, so that the batch job has some time to spin up. The job progresses through a through
phases before it gets to RUNNING state and polling for each phase of that sequence might help.
Since batch jobs tend to be long-running jobs (rather than near-real time jobs), it might
help to issue less frequent polls when it's in the RUNNING state. Something on the order of
10's seconds might be reasonable for batch jobs? Maybe the class could expose a parameter
for the rate of polling (or a callable)?

  was:
The AWS Batch Operator attempts to use a boto3 feature that is not available and has not been
merged in years, see

- https://github.com/boto/botocore/pull/1307
- see also https://github.com/broadinstitute/cromwell/issues/4303

This is a curious case of premature optimization.  So, in the meantime, this means that the
fallback is the exponential backoff routine for the status checks on the batch job.  Unfortunately,
when the concurrency of Airflow jobs is very high (100's of tasks), this fallback polling
hits the AWS Batch API too hard and the AWS API throttle throws an error, which fails the
Airflow task, simply because the status is polled too frequently.

Check the output from the retry algorithm, e.g. within the first 10 retries, the status of
an AWS batch job is checked about 10 times at a rate that is approx 1 retry/sec.  When an
Airflow instance is running 10's or 100's of concurrent batch jobs, this hits the API too
frequently and crashes the Airflow task (plus it occupies a worker in too much busy work).

In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)]                             
                                                                                         
                          
Out[4]: 
[1.0,
 1.01,
 1.04,
 1.09,
 1.1600000000000001,
 1.25,
 1.36,
 1.4900000000000002,
 1.6400000000000001,
 1.81,
 2.0,
 2.21,
 2.4400000000000004,
 2.6900000000000004,
 2.9600000000000004,
 3.25,
 3.5600000000000005,
 3.8900000000000006,
 4.24,
 4.61]


Possible solutions are to introduce an initial sleep (say 60 sec?) right after issuing the
request, so that the batch job has some time to spin up.  The job progresses through a through
phases before it gets to RUNNING state and polling for each phase of that sequence might help.
 Since batch jobs tend to be long-running jobs (rather than near-real time jobs), it might
help to issue less frequent polls when it's in the RUNNING state.  Something on the order
of 10's seconds might be reasonable for batch jobs?  Maybe the class could expose a parameter
for the rate of polling (or a callable)?



> AWS Batch Operator - status polling too often, esp. for high concurrency
> ------------------------------------------------------------------------
>
>                 Key: AIRFLOW-5218
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: aws, contrib
>    Affects Versions: 1.10.4
>            Reporter: Darren Weber
>            Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available and has
not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this means that
the fallback is the exponential backoff routine for the status checks on the batch job. Unfortunately,
when the concurrency of Airflow jobs is very high (100's of tasks), this fallback polling
hits the AWS Batch API too hard and the AWS API throttle throws an error, which fails the
Airflow task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, the status
of an AWS batch job is checked about 10 times at a rate that is approx 1 retry/sec. When an
Airflow instance is running 10's or 100's of concurrent batch jobs, this hits the API too
frequently and crashes the Airflow task (plus it occupies a worker in too much busy work).
> {code:java}
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
>  Out[4]: 
>  [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1600000000000001,
>  1.25,
>  1.36,
>  1.4900000000000002,
>  1.6400000000000001,
>  1.81,
>  2.0,
>  2.21,
>  2.4400000000000004,
>  2.6900000000000004,
>  2.9600000000000004,
>  3.25,
>  3.5600000000000005,
>  3.8900000000000006,
>  4.24,
>  4.61]{code}
> Possible solutions are to introduce an initial sleep (say 60 sec?) right after issuing
the request, so that the batch job has some time to spin up. The job progresses through a
through phases before it gets to RUNNING state and polling for each phase of that sequence
might help. Since batch jobs tend to be long-running jobs (rather than near-real time jobs),
it might help to issue less frequent polls when it's in the RUNNING state. Something on the
order of 10's seconds might be reasonable for batch jobs? Maybe the class could expose a parameter
for the rate of polling (or a callable)?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Mime
View raw message