flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravinder Kaur <neetu0...@gmail.com>
Subject Re: Sink Parallelism
Date Tue, 19 Apr 2016 15:04:44 GMT
Hello Chesnay,

Thank you for the reply. According to this
if I set -p = 2 then sink will also have 2 Sink subtaks and the final
result will end up in 2 stream partitions or say 2 chunks and combining
them will be the global result of the WordCount of input Dataset. And when
I say I have 2 taskmanagers with one taskslot each these 2 chunks are saved
on 2 machines in the end.

I have attached an image of my understanding by working out an example
WordCount with -p = 4. ​​Could you also explain how the communication among
taskmanagers happen while redistributing streams and how tuples with same
key end up in one taskmanager? Basically the implementation of groupBy on
multiple taskmanagers.

Ravinder Kaur

On Tue, Apr 19, 2016 at 4:01 PM, Chesnay Schepler <chesnay@apache.org>

> The picture you reference does not really show how dataflows are connected.
> For a better picture, visit this link:
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/concepts/concepts.html#parallel-dataflows
> Let me know if this doesn't answer your question.
> On 19.04.2016 14:22, Ravinder Kaur wrote:
>> Hello All,
>> Considering the following streaming dataflow of the example WordCount, I
>> want to understand how Sink is parallelised.
>> Source --> flatMap --> groupBy(), sum() --> Sink
>> If I set the paralellism at runtime using -p, as shown here
>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
>> I want to understand how Sink is done parallelly and how the global
>> result is distributed.
>> As far as I understood groupBy(0) is applied to the tuples<String,
>> Integer> emitted from the flatMap funtion, which groupes by the String
>> value and sum(1) aggregates the Integer value getting the count.
>> That means streams will be redistributed so that tuples grouped by the
>> same String value be sent to one taskmanager and the Sink step should be
>> writing the results to the specified path. When Sink step is also
>> parallelised then each taskmanager should emit a chunk. These chunks put
>> together must be the global result.
>> But when I see the pictorial representation it seems that each task slot
>> will run a copy of the streaming dataflow and will be performing the
>> operations on the chunk of data it gets and outputs the result. But if this
>> is the case the global result would have duplicates of strings and would be
>> wrong.
>> Could one of you kindly clarify what exactly happens?
>> Kind Regards,
>> Ravinder Kaur

View raw message