flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Bunk <stefan.b...@googlemail.com>
Subject Re: Distribute DataSet to subset of nodes
Date Tue, 15 Sep 2015 21:15:24 GMT
Hi Fabian,

I think we might have a misunderstanding here. I have already copied the
first file to five nodes, and the second file to five other nodes, outside
of Flink. In the open() method of the operator, I just read that file via
normal Java means. I do not see, why this is tricky or how HDFS should help
here.
Then, I have a normal Flink DataSet, which I want to run through the
operator (using the previously read data in the flatMap implementation). As
I run the program several times, I do not want to broadcast the data every
time, but rather just copy it on the nodes, and be fine with it.

Can you answer my question from above? If the setParallelism-method works
and selects five nodes for the first flatMap and five _other_ nodes for the
second flatMap, then that would be fine for me if there is no other easy
solution.

Thanks for your help!
Best
Stefan


On 14 September 2015 at 22:28, Fabian Hueske <fhueske@gmail.com> wrote:

> Hi Stefan,
>
> forcing the scheduling of tasks to certain nodes and reading files from
> the local file system in a multi-node setup is actually quite tricky and
> requires a bit understanding of the internals.
> It is possible and I can help you with that, but would recommend to use a
> shared filesystem such as HDFS if that is possible.
>
> Best, Fabian
>
> 2015-09-14 19:16 GMT+02:00 Stefan Bunk <stefan.bunk@googlemail.com>:
>
>> Hi,
>>
>> actually, I am distributing my data before the program starts, without
>> using broadcast sets.
>>
>> However, the approach should still work, under one condition:
>>
>>> DataSet mapped1 =
>>> data.flatMap(yourMap).withBroadcastSet(smallData1,"data").setParallelism(5);
>>> DataSet mapped2 =
>>> data.flatMap(yourMap).withBroadcastSet(smallData2,"data").setParallelism(5);
>>>
>> Is it guaranteed, that this selects a disjoint set of nodes, i.e. five
>> nodes for mapped1 and five other nodes for mapped2?
>>
>> Is there any way of selecting the five nodes concretely? Currently, I
>> have stored the first half of the data on nodes 1-5 and the second half on
>> nodes 6-10. With this approach, I guess, nodes are selected randomly so I
>> would have to copy both halves to all of the nodes.
>>
>> Best,
>> Stefan
>>
>>
>

Mime
View raw message