airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Commented] (AIRFLOW-2222) GoogleCloudStorageHook.copy fails for large files between locations
Date Sun, 02 Sep 2018 18:09:02 GMT


Apache Spark commented on AIRFLOW-2222:

User 'berislavlopac' has created a pull request for this issue:

> GoogleCloudStorageHook.copy fails for large files between locations
> -------------------------------------------------------------------
>                 Key: AIRFLOW-2222
>                 URL:
>             Project: Apache Airflow
>          Issue Type: Bug
>            Reporter: Berislav Lopac
>            Assignee: Berislav Lopac
>            Priority: Major
>             Fix For: 1.10.0, 2.0.0
> When copying large files (confirmed for around 3GB) between buckets in different projects,
the operation fails and the Google API returns error [413—Payload Too Large|].
The documentation for the error says:
> {quote}The Cloud Storage JSON API supports up to 5 TB objects.
> This error may, alternatively, arise if copying objects between locations and/or storage
classes can not complete within 30 seconds. In this case, use the [Rewrite|]
method instead.{quote}
> The reason seems to be that the {{GoogleCloudStorageHook.copy}} is using the API {{copy}}
> h3. Proposed Solution
> There are two potential solutions:
> # Implement {{GoogleCloudStorageHook.rewrite}} method which can be called from operators
and other objects to ensure successful execution. This method is more flexible but requires
changes both in the {{GoogleCloudStorageHook}} class and any other classes that use it for
copying files to ensure that they explicitly call {{rewrite}} when needed.
> # Modify {{GoogleCloudStorageHook.copy}} to determine when to use {{rewrite}} instead
of {{copy}} underneath. This requires updating only the {{GoogleCloudStorageHook}} class,
but the logic might not cover all the edge cases and could be difficult to implement.

This message was sent by Atlassian JIRA

View raw message