airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Barry Hart (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AIRFLOW-15) Remove GCloud from Airflow
Date Wed, 06 Dec 2017 18:42:00 GMT

    [ https://issues.apache.org/jira/browse/AIRFLOW-15?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280644#comment-16280644
] 

Barry Hart commented on AIRFLOW-15:
-----------------------------------

I understand. "Wait and see" might be a reasonable strategy. (I.e. until Google clarifies
the message). The specific reason for my comment is that one of our DAGs transfers a very
large number of files to and from Google Storage. With this number of files, we almost always
see some transient 5XX errors from the Google side, so we see some value in the google-cloud-python-library,
which has retry logic built in, both [generally|https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/api_core/google/api_core/retry.py]
and [specifically for Google Storage|https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/storage/google/cloud/storage/blob.py#L84-L91].)

(Although Airflow has its own retry support, I see those as being intended for coarse-grained
retries (i.e. when one task does a few things). When one task is transferring thousands of
files, it seems useful to retry internal to the task as well (per file).

Let me know what you think. It may be worth creating a ticket about retries to perhaps get
input from other users. For now, we can use the google-cloud-python-library directly from
our DAGs.

> Remove GCloud from Airflow
> --------------------------
>
>                 Key: AIRFLOW-15
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-15
>             Project: Apache Airflow
>          Issue Type: Task
>          Components: gcp
>            Reporter: Chris Riccomini
>            Assignee: Chris Riccomini
>              Labels: gcp
>
> After speaking with Google, there was some concern about using the [gcloud-python|https://github.com/GoogleCloudPlatform/gcloud-python]
library for Airflow. There are several concerns:
> # It's not clear (even to people at Google) what this library is, who owns it, etc.
> # It does not support all services (the way [google-api-python-client|https://github.com/google/google-api-python-client]
does).
> # There are compatibility issues between google-api-python-client and gcloudpython.
> We currently support both, after libraries depending on which package you you install:
{{airfow[gcp_api]}} or {{airflow[gcloud]}}. This ticket is to remove the {{airflow[gcloud]}}
packaged, and all associated code.
> The main associated code, afaik, is the use of the {{gcloud}} library in the Google cloud
storage hooks/operators--specifically for Google cloud storage Airfow logging.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message