kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Guozhang Wang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (KAFKA-6037) Make Sub-topology Parallelism Tunable
Date Tue, 10 Oct 2017 00:47:00 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-6037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Guozhang Wang updated KAFKA-6037:
---------------------------------
    Description: 
Today the downstream sub-topology's parallelism (aka the number of tasks) are purely dependent
on the upstream sub-topology's parallelism, which ultimately depends on the source topic's
num. partitions. However this does not work perfectly with dynamic scaling scenarios.

Imagine if your have a simple aggregation application, it would have two sub-topologies cut
by the repartition topic, the first sub-topology would be very light as it reads from input
topics and write to repartition topic based on the agg-key; the second sub-topology would
do the actual work with the agg state store, etc, hence is heavy computational. Right now
the first and second topology will always have the same number of tasks as the repartition
topic num.partitions is defined to be the same as the source topic num.partitions, so to scale
up we have to increase the number of input topic partitions.

One way to improve on that, is to use a default large number for repartition topics and also
allow users to override it (either through DSL code, or through config). Doing this different
sub-topologies would have different number of tasks, i.e. parallelism units. In addition,
users may also use this config to "hint" the DSL translator to NOT create the repartition
topics (i.e. to not break up the sub-topologies) when she has the knowledge of the data format.

> Make Sub-topology Parallelism Tunable
> -------------------------------------
>
>                 Key: KAFKA-6037
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6037
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Guozhang Wang
>              Labels: needs-kip
>
> Today the downstream sub-topology's parallelism (aka the number of tasks) are purely
dependent on the upstream sub-topology's parallelism, which ultimately depends on the source
topic's num. partitions. However this does not work perfectly with dynamic scaling scenarios.
> Imagine if your have a simple aggregation application, it would have two sub-topologies
cut by the repartition topic, the first sub-topology would be very light as it reads from
input topics and write to repartition topic based on the agg-key; the second sub-topology
would do the actual work with the agg state store, etc, hence is heavy computational. Right
now the first and second topology will always have the same number of tasks as the repartition
topic num.partitions is defined to be the same as the source topic num.partitions, so to scale
up we have to increase the number of input topic partitions.
> One way to improve on that, is to use a default large number for repartition topics and
also allow users to override it (either through DSL code, or through config). Doing this different
sub-topologies would have different number of tasks, i.e. parallelism units. In addition,
users may also use this config to "hint" the DSL translator to NOT create the repartition
topics (i.e. to not break up the sub-topologies) when she has the knowledge of the data format.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message