airflow-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kengo Seki (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (AIRFLOW-2382) Fix wrong description for delimiter
Date Thu, 26 Apr 2018 17:36:00 GMT

     [ https://issues.apache.org/jira/browse/AIRFLOW-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Kengo Seki reassigned AIRFLOW-2382:
-----------------------------------

    Assignee: Kengo Seki

> Fix wrong description for delimiter
> -----------------------------------
>
>                 Key: AIRFLOW-2382
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2382
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: aws, operators
>            Reporter: Kengo Seki
>            Assignee: Kengo Seki
>            Priority: Major
>
> The document for S3ListOperator says:
> {code}
> :param delimiter: The delimiter by which you want to filter the objects.
>     For e.g to lists the CSV files from in a directory in S3 you would use
>     delimiter='.csv'.
> {code}
> {code}
> **Example**:
>     The following operator would list all the CSV files from the S3
>     ``customers/2018/04/`` key in the ``data`` bucket. ::
>         s3_file = S3ListOperator(
>             task_id='list_3s_files',
>             bucket='data',
>             prefix='customers/2018/04/',
>             delimiter='.csv',
>             aws_conn_id='aws_customers_conn'
>         )
> {code}
> but it actually behaves oppositely:
> {code}
> In [1]: from airflow.contrib.operators.s3_list_operator import S3ListOperator
> In [2]: S3ListOperator(task_id='t', bucket='bkt0', prefix='', aws_conn_id='s3').execute(None)
> [2018-04-26 10:34:27,001] {connectionpool.py:735} INFO - Starting new HTTPS connection
(1): bkt0.s3.amazonaws.com
> [2018-04-26 10:34:27,711] {connectionpool.py:735} INFO - Starting new HTTPS connection
(1): bkt0.s3-ap-northeast-1.amazonaws.com
> [2018-04-26 10:34:27,801] {connectionpool.py:735} INFO - Starting new HTTPS connection
(1): bkt0.s3.ap-northeast-1.amazonaws.com
> Out[2]: ['0.csv', '1.txt', '2.jpg', '3.exe']
> In [3]: S3ListOperator(task_id='t', bucket='bkt0', prefix='', aws_conn_id='s3', delimiter='.csv').execute(None)
> [2018-04-26 10:34:39,722] {connectionpool.py:735} INFO - Starting new HTTPS connection
(1): bkt0.s3.amazonaws.com
> [2018-04-26 10:34:40,483] {connectionpool.py:735} INFO - Starting new HTTPS connection
(1): bkt0.s3-ap-northeast-1.amazonaws.com
> [2018-04-26 10:34:40,569] {connectionpool.py:735} INFO - Starting new HTTPS connection
(1): bkt0.s3.ap-northeast-1.amazonaws.com
> Out[3]: ['1.txt', '2.jpg', '3.exe']
> {code}
> This is because that the 'delimiter' parameter is for representing path hierarchy (so
'/' is used typically), not file extension. Also S3ToGoogleCloudStorageOperator has the same
problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message