airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (Jira)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-5889) AWS Batch Operator - API request limits should not fail a task
Date Thu, 12 Dec 2019 11:32:00 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994557#comment-16994557
] 

ASF subversion and git services commented on AIRFLOW-5889:
----------------------------------------------------------

Commit 479ee639219b1f3454b98c14811dfcdf7c4b4693 in airflow's branch refs/heads/master from
Darren Weber
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=479ee63 ]

[AIRFLOW-5889] Make polling for AWS Batch job status more resillient (#6765)

- errors in polling for job status should not fail
  the airflow task when the polling hits an API throttle
  limit; polling should detect those cases and retry a
  few times to get the job status, only failing the task
  when the job description cannot be retrieved
- added typing for the BatchProtocol method return
  types, based on the botocore.client.Batch types
- applied trivial format consistency using black, i.e.
  $ black -t py36 -l 96 {files}

> AWS Batch Operator - API request limits should not fail a task
> --------------------------------------------------------------
>
>                 Key: AIRFLOW-5889
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5889
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: aws, contrib
>    Affects Versions: 1.10.2, 1.10.3, 1.10.4, 1.10.5, 1.10.6
>            Reporter: Darren Weber
>            Assignee: Darren Weber
>            Priority: Major
>              Labels: AWS, aws-batch
>             Fix For: 1.10.7
>
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available and has
not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this means that
the fallback is the exponential backoff routine for the status checks on the batch job. Unfortunately,
when the concurrency of Airflow jobs is very high (100's of tasks), this fallback polling
hits the AWS Batch API too hard and the AWS API throttle throws an error, which fails the
Airflow task, simply because the status is polled too frequently.  This results in Airflow
issuing a retry of this task, when the task is actually running already, resulting in duplicate
batch jobs.  Any exception thrown for an AWS API throttle limit should not fail the task,
but just pause the polling for job status and retry the job status poll.
> This is an example of an API throttle exception:
> {code:java}
> An error occurred (TooManyRequestsException) when calling the DescribeJobs operation
> (reached max retries: 4): Too Many Requests
> {code}
> This exception should be handled while waiting for a job to complete, it must not result
in a job-retry.
> Reduced polling rates help (https://issues.apache.org/jira/browse/AIRFLOW-5218), but
additional exception handling in the polling function is required.  Within the exception
handling code, a random pause on the polling routine could help to alleviate the API throttle
limits.  Maybe the class could expose a parameter for the rate of polling (or a callable)?
> Another consideration is possible use of something like the sensor-poke approach, with
rescheduling, so that the polling process does not occupy a worker for the full duration of
a batch job, e.g.
> - [https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117]
> If a rescheduling approach is adopted, the similar API throttle considerations apply.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message