kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias J. Sax" <matth...@confluent.io>
Subject Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams
Date Wed, 03 Jul 2019 04:00:00 GMT
Levani,

Thanks for picking up this KIP! And thanks for summarizing everything.
Even if some points may have been discussed already (can't really
remember), it's helpful to get a good summary to refresh the discussion.

I think your reasoning makes sense. With regard to the distinction
between operators that manage topics and operators that use user-created
topics: Following this argument, it might indicate that leaving
`through()` as-is (as an operator that uses use-defined topics) and
introducing a new `repartition()` operator (an operator that manages
topics itself) might be good. Otherwise, there is one operator
`through()` that sometimes manages topics but sometimes not; a different
name, ie, new operator would make the distinction clearer.

About adding `numOfPartitions` to `Grouped`. I am wondering if the same
argument as for `Produced` does apply and adding it is semantically
questionable? Might be good to get opinions of others on this, too. I am
not sure myself what solution I prefer atm.

So far, KS uses configuration objects that allow to configure a certain
"entity" like a consumer, producer, store. If we assume that a topic is
a similar entity, I am wonder if we should have a
`Topic#withNumberOfPartitions()` class and method instead of a plain
integer? This would allow us to add other configs, like replication
factor, retention-time etc, easily, without the need to change the "main
API".

Just want to give some ideas. Not sure if I like them myself. :)


-Matthias



On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
> Actually, giving it more though - maybe enhancing Produced with num of partitions configuration
is not the best approach. Let me explain why:
> 
> 1) If we enhance Produced class with this configuration, this will also affect KStream#to
operation. Since KStream#to is the final sink of the topology, for me, it seems to be reasonable
assumption that user needs to manually create sink topic in advance. And in that case, having
num of partitions configuration doesn’t make much sense. 
> 
> 2) Looking at Produced class, based on API contract, seems like Produced is designed
to be something that is explicitly for producer (key serializer, value serializer, partitioner
those all are producer specific configurations) and num of partitions is topic level configuration.
And I don’t think mixing topic and producer level configurations together in one class is
the good approach.
> 
> 3) Looking at KStream interface, seems like Produced parameter is for operations that
work with non-internal (e.g topics created and managed internally by Kafka Streams) topics
and I think we should leave it as it is in that case.
> 
> Taking all this things into account, I think we should distinguish between DSL operations,
where Kafka Streams should create and manage internal topics (KStream#groupBy) vs topics that
should be created in advance (e.g KStream#to).
> 
> To sum it up, I think adding numPartitions configuration in Produced will result in mixing
topic and producer level configuration in one class and it’s gonna break existing API semantics.
> 
> Regarding making topic name optional in KStream#through - I think underline idea is very
useful and giving users possibility to specify num of partitions there is even more useful
:) Considering arguments against adding num of partitions in Produced class, I see two options
here:
> 1) Add following method overloads
> 	* through() - topic will be auto-generated and num of partitions will be taken from
source topic
>  	* through(final int numOfPartitions) - topic will be auto generated with specified
num of partitions
> 	* through(final int numOfPartitions, final Produced<K, V> produced) - topic will
be with generated with specified num of partitions and configuration taken from produced parameter.
> 2) Leave KStream#through as it is and introduce new method - KStream#repartition (I think
Matthias suggested this in one of the threads)
> 
> Considering all mentioned above I propose the following plan:
> 
> Option A:
> 1) Leave Produced as it is
> 2) Add num of partitions configuration to Grouped class (as mentioned in the KIP)
> 3) Add following method overloads to KStream#through
> 	* through() - topic will be auto-generated and num of partitions will be taken from
source topic
>  	* through(final int numOfPartitions) - topic will be auto generated with specified
num of partitions
> 	* through(final int numOfPartitions, final Produced<K, V> produced) - topic will
be with generated with specified num of partitions and configuration taken from produced parameter.
> 
> Option B:
> 1) Leave Produced as it is
> 2) Add num of partitions configuration to Grouped class (as mentioned in the KIP)
> 3) Add new operator KStream#repartition for creating and managing internal repartition
topics
> 
> P.S. I’m sorry if all of this was already discussed in the mailing list, but I kinda
got with all the threads that were about this KIP :(
> 
> Kind regards,
> Levani
> 
>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <levani.codes@gmail.com> wrote:
>>
>> Hello,
>>
>> I would like to resurrect discussion around KIP-221. Going through the discussion
thread, there’s seems to agreement around usefulness of this feature. 
>> Regarding the implementation, as far as I understood, the most optimal solution for
me seems the following:
>>
>> 1) Add two method overloads to KStream#through method (essentially making topic name
optional)
>> 2) Enhance Produced class with numOfPartitions configuration field.
>>
>> Those two changes will allow DSL users to control parallelism and trigger re-partition
without doing stateful operations.
>>
>> I will update KIP with interface changes around KStream#through if this changes sound
sensible.
>>
>> Kind regards,
>> Levani
> 
> 


Mime
View raw message