beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kenneth Knowles (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-434) Limit the number of output files a beam-examples execution writes
Date Tue, 12 Jul 2016 20:39:20 GMT

    [ https://issues.apache.org/jira/browse/BEAM-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373644#comment-15373644
] 

Kenneth Knowles commented on BEAM-434:
--------------------------------------

OK, I'm pretty convinced I was wrong, by the argument that users are going to copy/paste/modify
the example and assume each piece of it is important and should be retained in their own code.

I _do_ think it is very important that users know that the runner controls the number of bundles
& shards and that if you want a particular number then you have to hardcode it. But I
want users to know that this is a special case with real downsides. My thinking had been that
making it explicit in the example would make it clear that the reason there are very few shards
is because we hardcoded it. But it would also imply that this is something one should do by
default, the opposite of the desired message.

So now I favor a variant of [~dhalperi@google.com]'s  option 3, which is an implementation
detail of "the direct runner should - via whatever means - limit the number of output shards
of Write (not just text, but probably most or all) to a simple human readable number".

But I think having a fixed number in the absence of code fixing that number would also set
the wrong expectation. Thus I think it is very important to follow [~frances]'s idea to make
the number variable. I'd suggest a range of 3 to 7. Somehow two shards just doesn't seem "sharded"
enough for me. Using the usual override approach, as proposed, is probably the easiest implementation
technique. That last will be best decided by [~tgroh].

> Limit the number of output files a beam-examples execution writes
> -----------------------------------------------------------------
>
>                 Key: BEAM-434
>                 URL: https://issues.apache.org/jira/browse/BEAM-434
>             Project: Beam
>          Issue Type: Bug
>          Components: examples-java
>            Reporter: Amit Sela
>            Assignee: Amit Sela
>            Priority: Minor
>
> When using `TextIO.Write.to("/path/to/output")` without any restrictions on the number
of shards, it might generate many output files (depending on your input), for WordCount for
example, you'll get as many output files as unique words in your input.
> Since I think examples are expected to execute in a friendly manner to "see" what it
does and not optimize for performance in some way, I suggest to use `withoutSharding()` when
writing the example output to an output file.
> Examples I could find that behave this way:
> org.apache.beam.examples.WordCount
> org.apache.beam.examples.complete.TfIdf
> org.apache.beam.examples.cookbook.DeDupExample



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message