beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vilhelm von Ehrenheim (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (BEAM-2338) Directory filepattern wildcard broken in python SDK
Date Mon, 22 May 2017 12:21:04 GMT

     [ https://issues.apache.org/jira/browse/BEAM-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vilhelm von Ehrenheim updated BEAM-2338:
----------------------------------------
    Description: 
Validation of file patterns containing wildcard (`*`) in directories does not always work.


Some kinds of patterns generates an error from here during validation:
https://github.com/apache/beam/blob/v2.0.0/sdks/python/apache_beam/io/filebasedsource.py#L168

I've tried a few different FileSystems match commands which confuses be a bit.

Full path works:
{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF'],
limits=[1])[0].metadata_list
[FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF,
74721736)]
{noformat}

Glob star on directory does not
{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342016259LGN00_B1.TIF'],
limits=[1])[0].metadata_list
[]
{noformat}

If adding a star on the file level only searching for TIF files it works (all tough we match
a different file but that is fine)
{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/*.TIF'],
limits=[1])[0].metadata_list
[FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF,
65862791)]
{noformat}

Ok, Here comes the even more strange case. 
Looking for the same file we found with the patterns that but with a star on the dir we find
it!!
{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342013106LGN01_B1.TIF'],
limits=[1])[0].metadata_list
[FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF,
65862791)]
{noformat}

Also looking at the first case again we will match if the star is placed late enough in the
pattern to make the directory unique.

{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN*/LC80440342016259LGN00_B1.TIF'],
limits=[1])[0].metadata_list
[FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF,
74721736)]
{noformat}

but not if further up in the name

{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC8044034201*/LC80440342016259LGN00_B1.TIF'],
limits=[1])[0].metadata_list
[]
{noformat}

My guess is that some folders are dropped from the list of matched directories or something
which is a bit concerning. 

  was:
Validation of file patterns containing wildcard (`*`) in directories does not work if the
filename is specified fully. 

Some kinds of patterns generates an error from here:
https://github.com/apache/beam/blob/v2.0.0/sdks/python/apache_beam/io/filebasedsource.py#L168

I've tried a few different FileSystems match commands which confuses be a bit.

Full path works:
{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF'],
limits=[1])[0].metadata_list
[FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF,
74721736)]
{noformat}

Glob star on directory does not
{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342016259LGN00_B1.TIF'],
limits=[1])[0].metadata_list
[]
{noformat}

If adding a star on the file level only searching for TIF files it works (all tough we match
a different file but that is fine)
{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/*.TIF'],
limits=[1])[0].metadata_list
[FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF,
65862791)]
{noformat}

Ok, Here comes the even more strange case. 
Looking for the same file we found with the patterns that but with a star on the dir we find
it!!
{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342013106LGN01_B1.TIF'],
limits=[1])[0].metadata_list
[FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF,
65862791)]
{noformat}

Also looking at the first case again we will match if the star is placed late enough in the
pattern to make the directory unique.

{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN*/LC80440342016259LGN00_B1.TIF'],
limits=[1])[0].metadata_list
[FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF,
74721736)]
{noformat}

but not if further up in the name

{noformat}
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC8044034201*/LC80440342016259LGN00_B1.TIF'],
limits=[1])[0].metadata_list
[]
{noformat}

My guess is that some folders are dropped from the list of matched directories or something.



> Directory filepattern wildcard broken in python SDK
> ---------------------------------------------------
>
>                 Key: BEAM-2338
>                 URL: https://issues.apache.org/jira/browse/BEAM-2338
>             Project: Beam
>          Issue Type: Bug
>          Components: beam-model
>    Affects Versions: 2.0.0
>            Reporter: Vilhelm von Ehrenheim
>            Assignee: Frances Perry
>
> Validation of file patterns containing wildcard (`*`) in directories does not always
work. 
> Some kinds of patterns generates an error from here during validation:
> https://github.com/apache/beam/blob/v2.0.0/sdks/python/apache_beam/io/filebasedsource.py#L168
> I've tried a few different FileSystems match commands which confuses be a bit.
> Full path works:
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF'],
limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF,
74721736)]
> {noformat}
> Glob star on directory does not
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342016259LGN00_B1.TIF'],
limits=[1])[0].metadata_list
> []
> {noformat}
> If adding a star on the file level only searching for TIF files it works (all tough we
match a different file but that is fine)
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/*.TIF'],
limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF,
65862791)]
> {noformat}
> Ok, Here comes the even more strange case. 
> Looking for the same file we found with the patterns that but with a star on the dir
we find it!!
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342013106LGN01_B1.TIF'],
limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF,
65862791)]
> {noformat}
> Also looking at the first case again we will match if the star is placed late enough
in the pattern to make the directory unique.
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN*/LC80440342016259LGN00_B1.TIF'],
limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF,
74721736)]
> {noformat}
> but not if further up in the name
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC8044034201*/LC80440342016259LGN00_B1.TIF'],
limits=[1])[0].metadata_list
> []
> {noformat}
> My guess is that some folders are dropped from the list of matched directories or something
which is a bit concerning. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message