flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chesnay Schepler <ches...@apache.org>
Subject Re: Sink Parallelism
Date Tue, 19 Apr 2016 14:01:48 GMT
The picture you reference does not really show how dataflows are connected.
For a better picture, visit this link: 

Let me know if this doesn't answer your question.

On 19.04.2016 14:22, Ravinder Kaur wrote:
> Hello All,
> Considering the following streaming dataflow of the example WordCount, 
> I want to understand how Sink is parallelised.
> Source --> flatMap --> groupBy(), sum() --> Sink
> If I set the paralellism at runtime using -p, as shown here 
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
> I want to understand how Sink is done parallelly and how the global 
> result is distributed.
> As far as I understood groupBy(0) is applied to the tuples<String, 
> Integer> emitted from the flatMap funtion, which groupes by the String 
> value and sum(1) aggregates the Integer value getting the count.
> That means streams will be redistributed so that tuples grouped by the 
> same String value be sent to one taskmanager and the Sink step should 
> be writing the results to the specified path. When Sink step is also 
> parallelised then each taskmanager should emit a chunk. These chunks 
> put together must be the global result.
> But when I see the pictorial representation it seems that each task 
> slot will run a copy of the streaming dataflow and will be performing 
> the operations on the chunk of data it gets and outputs the result. 
> But if this is the case the global result would have duplicates of 
> strings and would be wrong.
> Could one of you kindly clarify what exactly happens?
> Kind Regards,
> Ravinder Kaur

View raw message