flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tillrohrmann <...@git.apache.org>
Subject [GitHub] flink pull request: [FLINK-1901] [core] Create sample operator for...
Date Fri, 31 Jul 2015 16:25:01 GMT
Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/949#issuecomment-126740651
  
    Thanks for your contribution @ChengXiangLi. The code is really well tested and well structured.
Great work :-)
    
    I had only some minor comments. There is however one thing I'm not so sure about. With
the current implementation, all parallel tasks of the sampling operator will get the same
random generator/seed value. Thus, every node will generate the same sequence of random numbers.
I think this can have a negative influence on the sampling. What we could do is to use `RichMapPartitionFunction`
instead of the `MapPartitionFunction`. With the rich function, we either have access to the
subtask index, given by `getRuntimeContext().getIndexOfThisSubtask()`,  which we could use
to modify the initial seed or we generate the random number generator in the `open` method
(this method is executed on the TaskManager). Assuming that the clocks are not completely
synchronized and that the individual tasks will be instantiated not at the same time, this
could give us less correlated random number sequences. What do you think? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message