beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillermo Rodríguez Cano (JIRA) <j...@apache.org>
Subject [jira] [Comment Edited] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)
Date Thu, 27 Jul 2017 10:35:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16103045#comment-16103045
] 

Guillermo Rodríguez Cano edited comment on BEAM-2490 at 7/27/17 10:34 AM:
--------------------------------------------------------------------------

Hello again, and a quick update,

* OS: Mac OS X Sierra 10.12.6
* Apache Beam: 2.2.0dev (aka HEAD at master branch as of 8 hours ago...)
* Python: 2.7.13 
* Runner: DirectRunner (so far given the "results")

I ran pretty much the same experiment I ran at the end of June (described here: https://issues.apache.org/jira/browse/BEAM-2490?focusedCommentId=16063224)
with the latest version as of the HEAD in the master branch of the Apache Beam repository
and unfortunately the results are the same so far: no results.

My laptop was running this all night and after 8 hours it is still not finished (for a 'job'
of 8 files gzipped JSON files of 200-300 MB compressed) and no output. I also ran the same
experiment with only one file in the subdirectory where I use the operator, and it is still
running although I got some output but I don't think that it is ok that it takes more than
3 hours to process just one file...
Since these tests haven't finished I couldn't test on DataFlow yet (besides I still haven't
figured out how to package the HEAD or a tag for that matter of beam for DataFlow. No matter
how I try, I always get something along this line: {{Could not find a version that satisfies
the requirement apache-beam==2.1.0 (from versions: 0.6.0, 2.0.0)}} Suggestions?).

So I can't confirm that this issue is really resolved unfortunately. I don't think this issue
is related to https://issues.apache.org/jira/browse/BEAM-2497 and more to https://issues.apache.org/jira/browse/BEAM-2531
I suspect all files are read (hence it is likely that the glob operator works) but due to
the performance of the decompression we don't know that for sure.


was (Author: wileeam):
Hello again, and a quick update,

* OS: Mac OS X Sierra 10.12.6
* Apache Beam: 2.2.0dev (aka HEAD at master branch as of 8 hours ago...)
* Python: 2.7.13 
* Runner: DirectRunner (so far given the "results")

I ran pretty much the same experiment I ran at the end of June (described here: https://issues.apache.org/jira/browse/BEAM-2490?focusedCommentId=16063224)
with the latest version as of the HEAD in the master branch of the Apache Beam repository
and unfortunately the results are the same so far: no results.

My laptop was running this all night and after 8 hours it is still not finished (for a 'job'
of 8 files gzipped JSON files of 200-300 MB compressed) and no output. I also ran the same
experiment with only one file in the subdirectory where I use the operator, and it is still
running although I got some output but I don't think that it is ok that it takes more than
3 hours to process just one file...
Since these tests haven't finished I couldn't test on DataFlow yet (besides I still haven't
figured out how to package the HEAD or a tag for that matter of beam for DataFlow. No matter
how I try, I always get something along this line: `Could not find a version that satisfies
the requirement apache-beam==2.1.0 (from versions: 0.6.0, 2.0.0)` Suggestions?).

So I can't confirm that this issue is really resolved unfortunately. I don't think this issue
is related to https://issues.apache.org/jira/browse/BEAM-2497 and more to https://issues.apache.org/jira/browse/BEAM-2531
I suspect all files are read (hence it is likely that the glob operator works) but due to
the performance of the decompression we don't know that for sure.

> ReadFromText function is not taking all data with glob operator (*) 
> --------------------------------------------------------------------
>
>                 Key: BEAM-2490
>                 URL: https://issues.apache.org/jira/browse/BEAM-2490
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py
>    Affects Versions: 2.0.0
>         Environment: Usage with Google Cloud Platform: Dataflow runner
>            Reporter: Olivier NGUYEN QUOC
>            Assignee: Chamikara Jayalath
>             Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xxxxxxxxxx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xxxxxxxxxx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xxxxxxxxxx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xxxxxxxxxx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xxxxxxxxxx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xxxxxxxxxx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>       "gs://XXXX_folder1/my_files_20160901*.csv.gz",
>       skip_header_lines=1,
>       compression_type=beam.io.filesystem.CompressionTypes.GZIP
>       )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline (instead of
a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>           "gs://XXXX_folder1/my_files_20160901*.csv.gz",
>           skip_header_lines=1,
>           compression_type=beam.io.filesystem.CompressionTypes.GZIP
>           )
>                        | 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
>                     )
> output = (
>           data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', num_shards=1)
>             )
> {code}
> Dataflow indicates me that the estimated size 	of the output after the ReadFromText step
is 602.29 MB only, which not correspond to any unique input file size nor the overall file
size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message