flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Theo Diefenthal <theo.diefent...@scoop-software.de>
Subject RE: Keyed stream, parallelism, load balancing and ensuring that the same key go to the same Task Manager and task slot
Date Thu, 19 Dec 2019 17:59:43 GMT
Hi Krzysztof,

You can just key your stream by transaction id. If you have lots of
different transaction ids, you can expect the load to be evenly
distributed. All events with the same key (==transaction id) will be
processed by the same task slot.

If you only have a few kafka partitions, you could key by transaction id
as early as possible in order to fully utilize your cluster. Remember,
however, that each keyby will cause a network shuffle, so it's probably
not worth it to fist key by transaction id, then by traded, and afterwards
again by transaction id. 

Best regards

-----Original Message-----
From: KristoffSC <krzysiek.chmielewski@gmail.com> 
Sent: Dienstag, 17. Dezember 2019 23:35
To: user@flink.apache.org
Subject: Keyed stream, parallelism, load balancing and ensuring that the
same key go to the same Task Manager and task slot

Hi community,
I'm trying to build a PoC pipeline for my project and I have few questions
regarding load balancing between task managers and ensuring that keyed
stream events for the same key will go to the same Task Manager (hence the
same task slot).

Lets assume that we have 3 task managers, 3 task slot each. So it gives us
9 task slots in total.
The source is a Kafka topic with N partitions. Events are "linked" with
each other by transactionId (long) field. So they can be keyed by this
Events for particular transactionId can be spanned across many partitions
(we don't have control over this).

The pipeline is:
1. Kafka Source -> produces RawEvents (map operator).
2. Enrichment with AsuncFuntion(simple DB/cache call) produces
EnrichedEvents with map operator.
3. Key EnrichedEvents by tradeId, buffer events for some time, sort them
by sequenceNumber (Window aggregation) and emit a new event based on
N sorted EnrichedEvents produces one TransactionEvent for this
4. Sink TransactionEvents

1. Have high task slot utilization (Low number of idle/un-addressed task
2. EnrichedEvents for the same transactionId should go to the same
TaskSlot (hence the same TaskManager).

How this can be achieved?
How parallelism value for each operator should be set?

Probably I can already key the original RawEvents on transactionId.


Sent from:

View raw message