flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chesnay Schepler <ches...@apache.org>
Subject Re: Decouple Kafka partitions and Flink parallelism for ordered streams
Date Wed, 11 Oct 2017 14:31:46 GMT
It is correct that keyBy and partition operations will distribute 
messages over the network
as they distribute the data across all subtasks. For this use-case we 
only want to consider
subtasks that are subsequent to our operator, like a local keyBy.

I don't think there is an obvious way to implement it, but I'm currently 
theory-crafting a bit
and will get back to you.

On 11.10.2017 14:52, Sanne de Roever wrote:
> Hi,
>
> Currently we need 75 Kafka partitions per topic and a parallelism of 
> 75 to meet required performance, increasing the partitions and 
> parallelism gives diminished returns
>
> Currently the performance is approx. 1500 msg/s per core, having one 
> pipeline (source, map, sink) deployed as one instance per core.
>
> The Kafka source performance is not an issue. The map is very heavy 
> (deserialization, validation) on rather complex Avro messages. Object 
> reuse is enabled.
>
> Ideally we would like to decouple Flink processing parallelism from 
> Kafka partitions in a following manner:
>
>   * Pick a source parallelism
>   * Per source, be able to pick a parallelism for the following map
>   * In such a way that some message key determines which -local- map
>     instance gets a message from a certain visitor
>   * So that messages with the same visitor key get processed by the
>     same map and in order for that visitor
>   * Output the result to Kafka
>
> AFAIK keyBy, partitionCustom will distribute messages over the network 
> and rescale has no affinity for message identity.
>
> Am I missing something obvious?
>
> Cheers,
>
> Sanne
>
>
>
>


Mime
View raw message