beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Frances Perry (JIRA)" <>
Subject [jira] [Commented] (BEAM-434) When examples write output to file it creates many output files instead of one
Date Tue, 12 Jul 2016 18:17:20 GMT


Frances Perry commented on BEAM-434:

Not overly constraining the sharding to allow the runner to choose bundling that allows good
performance is pretty key to the model. So I think it's pretty important to introduce users
to this idea in the examples.

The direct runner should be careful to create a small (but variable) number of files to show
that the default is *not* one or a fixed number. I'd prefer we fix this in a way that is *not*
specific to TextIO.Write -- the same thing will happen in many other places.

Can we wait for Thomas to return from vacation tomorrow and get his opinion?

> When examples write output to file it creates many output files instead of one
> ------------------------------------------------------------------------------
>                 Key: BEAM-434
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: examples-java
>            Reporter: Amit Sela
>            Assignee: Amit Sela
>            Priority: Minor
> When using `"/path/to/output")` without any restrictions on the number
of shards, it might generate many output files (depending on your input), for WordCount for
example, you'll get as many output files as unique words in your input.
> Since I think examples are expected to execute in a friendly manner to "see" what it
does and not optimize for performance in some way, I suggest to use `withoutSharding()` when
writing the example output to an output file.
> Examples I could find that behave this way:
> org.apache.beam.examples.WordCount
> org.apache.beam.examples.complete.TfIdf
> org.apache.beam.examples.cookbook.DeDupExample

This message was sent by Atlassian JIRA

View raw message