flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karthik Deivasigamani <karthi...@gmail.com>
Subject stream partitioning to avoid network overhead
Date Fri, 11 Aug 2017 02:54:48 GMT
Hi,

   I have a use case where we read messages from a Kafka topic and invoke a
webservice. The web-service call can take a take couple of seconds and then
gives us back on avg 800KB of data. This data is set to another operator
which does the parsing and then it gets sent to sink which saves the
processed data in a NoSQL db. The app looks like this :

[image: Inline image 1]
Since my payload from the web service is large a lot of data is transferred
over the network and this is becoming a bottle neck.

Lets say *I have 6 slots per node and I would like to have 1 slot for
source, 3 slots for web service calls, 2 for parser and 1 for my sink*.
This way all the processing can happen locally and there is no network
overhead. I have tried *stream.forward() *but it requires that the down
stream operator has the same number of parallelism as the one emitting
data. Next I tried *stream.rescale()* and that does not schedule the task
as I would expect it given the parallelism's on the operators are all
multiple of each other (my flink cluster has enough empty slots and
capacity).


[image: Inline image 2]


Is there a way to schedule my task's in a fashion where there is no data
transfer over the network. I was able to do this in apache storm by using
localOrShuffle grouping. Not sure how to acehive the same in flink. Any
pointers would be really helpful.

For now I have solved this problem by having the same parallelism on the
web-service operator, parser, sink which causes flink to chain these
operator together and execute them in the same thread.But ideally I would
like to have more instances of the slow operator and less instances of my
fast operator.

~
Karthik

Mime
View raw message