airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kengo Seki (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AIRFLOW-2382) Fix wrong description for delimiter
Date Thu, 26 Apr 2018 16:31:00 GMT
Kengo Seki created AIRFLOW-2382:
-----------------------------------

             Summary: Fix wrong description for delimiter
                 Key: AIRFLOW-2382
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2382
             Project: Apache Airflow
          Issue Type: Bug
          Components: aws, operators
            Reporter: Kengo Seki


The document for S3ListOperator says:

{code}
:param delimiter: The delimiter by which you want to filter the objects.
    For e.g to lists the CSV files from in a directory in S3 you would use
    delimiter='.csv'.
{code}

{code}
**Example**:
    The following operator would list all the CSV files from the S3
    ``customers/2018/04/`` key in the ``data`` bucket. ::

        s3_file = S3ListOperator(
            task_id='list_3s_files',
            bucket='data',
            prefix='customers/2018/04/',
            delimiter='.csv',
            aws_conn_id='aws_customers_conn'
        )
{code}

but it actually behaves oppositely:

{code}
In [1]: from airflow.contrib.operators.s3_list_operator import S3ListOperator

In [2]: S3ListOperator(task_id='t', bucket='bkt0', prefix='', aws_conn_id='s3').execute(None)
[2018-04-26 10:34:27,001] {connectionpool.py:735} INFO - Starting new HTTPS connection (1):
bkt0.s3.amazonaws.com
[2018-04-26 10:34:27,711] {connectionpool.py:735} INFO - Starting new HTTPS connection (1):
bkt0.s3-ap-northeast-1.amazonaws.com
[2018-04-26 10:34:27,801] {connectionpool.py:735} INFO - Starting new HTTPS connection (1):
bkt0.s3.ap-northeast-1.amazonaws.com
Out[2]: ['0.csv', '1.txt', '2.jpg', '3.exe']

In [3]: S3ListOperator(task_id='t', bucket='bkt0', prefix='', aws_conn_id='s3', delimiter='.csv').execute(None)
[2018-04-26 10:34:39,722] {connectionpool.py:735} INFO - Starting new HTTPS connection (1):
bkt0.s3.amazonaws.com
[2018-04-26 10:34:40,483] {connectionpool.py:735} INFO - Starting new HTTPS connection (1):
bkt0.s3-ap-northeast-1.amazonaws.com
[2018-04-26 10:34:40,569] {connectionpool.py:735} INFO - Starting new HTTPS connection (1):
bkt0.s3.ap-northeast-1.amazonaws.com
Out[3]: ['1.txt', '2.jpg', '3.exe']
{code}

This is because that the 'delimiter' parameter is for representing path hierarchy (so '/'
is used typically), not file extension. Also S3ToGoogleCloudStorageOperator has the same problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message