flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sachin Goel <sachingoel0...@gmail.com>
Subject Re: Distribute DataSet to subset of nodes
Date Sun, 13 Sep 2015 19:42:54 GMT
Hi Stefan
Just a clarification : The output corresponding to an element based on the
whole data will be a union of the outputs based on the two halves. Is this
what you're trying to achieve? [It appears  like that since every flatMap
task will independently produce outputs.]

In that case, one solution could be to simply have two flatMap operations
based on parts of the *broadcast* data set, and take a union.

On Sep 13, 2015 7:04 PM, "Stefan Bunk" <stefan.bunk@googlemail.com> wrote:

> Hi!
> Following problem: I have 10 nodes on which I want to execute a flatMap
> operator on a DataSet. In the open method of the operator, some data is
> read from disk and preprocessed, which is necessary for the operator.
> Problem is, the data does not fit in memory on one node, however, half of
> the data does.
> So in five out of ten nodes, I stored one half of the data to be read in
> the open method, and the other half on the other five nodes.
> Now my question: How can I distribute my DataSet, so that each element is
> sent once to a node with the first half of my data and once to a node with
> the other half?
> I looked at implementing a custom partitioner, however my problems were:
> (i) I have no mapping from the number I am supposed to return to the nodes
> to the data. How do I know, that index 5 contains one half of the data, and
> index 6 the other half?
> (ii) I do not know the current index. Obviously, I want to send my DataSet
> element only once over the network.
> Best,
> Stefan

View raw message