beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ismaël Mejía (JIRA) <>
Subject [jira] [Commented] (BEAM-434) When examples write output to file it creates many output files instead of one
Date Tue, 12 Jul 2016 14:53:20 GMT


Ismaël Mejía commented on BEAM-434:

The examples should be simpler, and in particular the 'hello world = wordcount' one, so the
less options the better.

If the user wants to define the sharding he could do so, once he progresses on Beam he will
know about such stuff, I think the important decision is what should be the default behavior
if numSharding is not defined for a Text.Write in the DirectRunner (or other runners) ?

Of course we can go the way of restricting this only in DirectRunner (#3) because distributed/parallel
writes are less of an issue there. However from a user point of view it is really confusing
that other runners also implement this differently, so I don't know if it makes sense to standarize
this, e.g. Spark produces more files if the output is big (for the reasons Amit explained),
but Flink produces only one (at least in my tests).

> When examples write output to file it creates many output files instead of one
> ------------------------------------------------------------------------------
>                 Key: BEAM-434
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: examples-java
>            Reporter: Amit Sela
>            Assignee: Amit Sela
>            Priority: Minor
> When using `"/path/to/output")` without any restrictions on the number
of shards, it might generate many output files (depending on your input), for WordCount for
example, you'll get as many output files as unique words in your input.
> Since I think examples are expected to execute in a friendly manner to "see" what it
does and not optimize for performance in some way, I suggest to use `withoutSharding()` when
writing the example output to an output file.
> Examples I could find that behave this way:
> org.apache.beam.examples.WordCount
> org.apache.beam.examples.complete.TfIdf
> org.apache.beam.examples.cookbook.DeDupExample

This message was sent by Atlassian JIRA

View raw message