beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-3247) Sample.any memory constraint
Date Mon, 27 Nov 2017 21:43:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267617#comment-16267617
] 

ASF GitHub Bot commented on BEAM-3247:
--------------------------------------

jkff commented on a change in pull request #4175: [BEAM-3247] fix Sample.any performance
URL: https://github.com/apache/beam/pull/4175#discussion_r153332162
 
 

 ##########
 File path: sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Sample.java
 ##########
 @@ -209,29 +202,67 @@ public void populateDisplayData(DisplayData.Builder builder) {
   }
 
   /**
-   * A {@link DoFn} that returns up to limit elements from the side input PCollection.
+   * A {@link DoFn} that outputs up to limit elements.
    */
-  private static class SampleAnyDoFn<T> extends DoFn<Void, T> {
-    long limit;
-    final PCollectionView<Iterable<T>> iterableView;
+  private static class SampleAnyDoFn<T> extends DoFn<T, T> {
 
 Review comment:
   Not sure why you say that: views are also per-window, so I think it shouldn't matter whether
the collection is bounded or unbounded. (though, of course, it'll be behaving weirdly in case
of multiple trigger firings - see also https://issues.apache.org/jira/browse/BEAM-2305, maybe
similar issues apply here too)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Sample.any memory constraint
> ----------------------------
>
>                 Key: BEAM-3247
>                 URL: https://issues.apache.org/jira/browse/BEAM-3247
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-core
>    Affects Versions: 2.1.0
>            Reporter: Neville Li
>            Assignee: Neville Li
>            Priority: Minor
>
> Right now {{Sample.any}} converts the collection to an iterable view and take first n
in a side input. This may require materializing the entire collection to disk and is potentially
inefficient.
> https://github.com/apache/beam/blob/v2.1.0/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Sample.java#L74
> It can be fixed by applying a truncating `DoFn` first, then a combine into `List<T>`
which limits the list size, and finally flattening the list.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message