beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amit Sela (JIRA)" <>
Subject [jira] [Commented] (BEAM-434) When examples write output to file it creates many output files instead of one
Date Tue, 12 Jul 2016 06:11:11 GMT


Amit Sela commented on BEAM-434:

I sort of prefer 2, but by letting the user pass the numShards configuration (which may need
a better name)
Like I mentioned in the PR, if we want to give a simple example result on one hand, while
keeping in the user's mind the fact that multiple shards are a thing to consider, we could
add a --numShards option and add it to the examples code with a default of 1 (or 3).
If we want the users to know about multiple output shards, why should we keep the examples
"pure" ? 

How about adding an option named "--numOutputShards" with default value 1 (or 3, I could live
with 3 :) ) and adding this to the examples README, thus giving a better experience in terms
of "seeing" the output, while keeping the multiple-shards "on the table" and as a bonus, the
Travis CI tests could still run with as many shards as we want (while I wanted examples to
be easy enough, I definitely didn't want that for Travis!)


> When examples write output to file it creates many output files instead of one
> ------------------------------------------------------------------------------
>                 Key: BEAM-434
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: examples-java
>            Reporter: Amit Sela
>            Assignee: Amit Sela
>            Priority: Minor
> When using `"/path/to/output")` without any restrictions on the number
of shards, it might generate many output files (depending on your input), for WordCount for
example, you'll get as many output files as unique words in your input.
> Since I think examples are expected to execute in a friendly manner to "see" what it
does and not optimize for performance in some way, I suggest to use `withoutSharding()` when
writing the example output to an output file.
> Examples I could find that behave this way:
> org.apache.beam.examples.WordCount
> org.apache.beam.examples.complete.TfIdf
> org.apache.beam.examples.cookbook.DeDupExample

This message was sent by Atlassian JIRA

View raw message