kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Roesler <j...@confluent.io>
Subject Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams
Date Wed, 17 Jul 2019 22:20:05 GMT
Yes, I believe that's what I had in mind. Again, not totally sure it
makes sense, but I believe something similar is the rationale for
having the partitioner option in Produced.

Thanks,
-John

On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
<levani.codes@gmail.com> wrote:
>
> Hey John,
>
> Oh that’s interesting use-case.
> Do I understand this correctly, in your example I would first issue repartition(Repartitioned)
with proper partitioner that essentially would be the same as the topic I want to join with
and then do the KStream#join with DSL?
>
> Regards,
> Levani
>
> > On Jul 17, 2019, at 11:11 PM, John Roesler <john@confluent.io> wrote:
> >
> > Hey, all, just to chime in,
> >
> > I think it might be useful to have an option to specify the
> > partitioner. The case I have in mind is that some data may get
> > repartitioned and then joined with an input topic. If the right-side
> > input topic uses a custom partitioning strategy, then the
> > repartitioned stream also needs to be partitioned with the same
> > strategy.
> >
> > Does that make sense, or did I maybe miss something important?
> >
> > Thanks,
> > -John
> >
> > On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
> > <levani.codes@gmail.com> wrote:
> >>
> >> Yes, I was thinking about it as well. To be honest I’m not sure about it yet.
> >> As Kafka Streams DSL user, I don’t really think I would need control over
partitioner for internal topics.
> >> As a user, I would assume that Kafka Streams knows best how to partition data
for internal topics.
> >> In this KIP I wrote that Produced should be used only for topics that are created
by user In advance.
> >> In those cases maybe it make sense to have possibility to specify the partitioner.
> >> I don’t have clear answer on that yet, but I guess specifying the partitioner
can be added as well if there’s agreement on this.
> >>
> >> Regards,
> >> Levani
> >>
> >>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <sophie@confluent.io>
wrote:
> >>>
> >>> Thanks for clearing that up. I agree that Repartitioned would be a useful
> >>> addition. I'm wondering if it might also need to have
> >>> a withStreamPartitioner method/field, similar to Produced? I'm not sure
how
> >>> widely this feature is really used, but seems it should be available for
> >>> repartition topics.
> >>>
> >>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <levani.codes@gmail.com>
> >>> wrote:
> >>>
> >>>> Hey Sophie,
> >>>>
> >>>> In both cases KStream#repartition and KStream#repartition(Repartitioned)
> >>>> topic will be created and managed by Kafka Streams.
> >>>> Idea of Repartitioned is to give user more control over the topic such
as
> >>>> num of partitions.
> >>>> I feel like Repartitioned parameter is something that is missing in
> >>>> current DSL design.
> >>>> Essentially giving user control over parallelism by configuring num
of
> >>>> partitions for internal topics.
> >>>>
> >>>> Hope this answers your question.
> >>>>
> >>>> Regards,
> >>>> Levani
> >>>>
> >>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <sophie@confluent.io>
> >>>> wrote:
> >>>>>
> >>>>> Hey Levani,
> >>>>>
> >>>>> Thanks for the KIP! Can you clarify one thing for me -- for the
> >>>>> KStream#repartition signature taking a Repartitioned, will the topic
be
> >>>>> auto-created by Streams (which seems to be the case for the signature
> >>>>> without a Repartitioned) or does it have to be pre-created? The
wording
> >>>> in
> >>>>> the KIP makes it seem like one version of the method will auto-create
> >>>>> topics while the other will not.
> >>>>>
> >>>>> Cheers,
> >>>>> Sophie
> >>>>>
> >>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
> >>>> levani.codes@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> One more bump about KIP-221 (
> >>>>>>
> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>> <
> >>>>>>
> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>> )
> >>>>>> so it doesn’t get lost in mailing list :)
> >>>>>> Would love to hear communities opinions/concerns about this
KIP.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Levani
> >>>>>>
> >>>>>>
> >>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <levani.codes@gmail.com
> >>>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> Kind reminder about this KIP:
> >>>>>>
> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>> <
> >>>>>>
> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Levani
> >>>>>>>
> >>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
> >>>> levani.codes@gmail.com
> >>>>>> <mailto:levani.codes@gmail.com>> wrote:
> >>>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> In order to move this KIP forward, I’ve updated confluence
page with
> >>>>>> the new proposal
> >>>>>>
> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>> <
> >>>>>>
> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
> >>>>>>>
> >>>>>>>> I’ve also filled “Rejected Alternatives” section.
> >>>>>>>>
> >>>>>>>> Looking forward to discuss this KIP :)
> >>>>>>>>
> >>>>>>>> King regards,
> >>>>>>>> Levani
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
> >>>> levani.codes@gmail.com
> >>>>>> <mailto:levani.codes@gmail.com>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hello Matthias,
> >>>>>>>>>
> >>>>>>>>> Thanks for the feedback and ideas.
> >>>>>>>>> I like the idea of introducing dedicated `Topic`
class for topic
> >>>>>> configuration for internal operators like `groupedBy`.
> >>>>>>>>> Would be great to hear others opinion about this
as well.
> >>>>>>>>>
> >>>>>>>>> Kind regards,
> >>>>>>>>> Levani
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax
<matthias@confluent.io
> >>>>>> <mailto:matthias@confluent.io>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Levani,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for picking up this KIP! And thanks for
summarizing
> >>>> everything.
> >>>>>>>>>> Even if some points may have been discussed
already (can't really
> >>>>>>>>>> remember), it's helpful to get a good summary
to refresh the
> >>>>>> discussion.
> >>>>>>>>>>
> >>>>>>>>>> I think your reasoning makes sense. With regard
to the distinction
> >>>>>>>>>> between operators that manage topics and operators
that use
> >>>>>> user-created
> >>>>>>>>>> topics: Following this argument, it might indicate
that leaving
> >>>>>>>>>> `through()` as-is (as an operator that uses
use-defined topics) and
> >>>>>>>>>> introducing a new `repartition()` operator (an
operator that manages
> >>>>>>>>>> topics itself) might be good. Otherwise, there
is one operator
> >>>>>>>>>> `through()` that sometimes manages topics but
sometimes not; a
> >>>>>> different
> >>>>>>>>>> name, ie, new operator would make the distinction
clearer.
> >>>>>>>>>>
> >>>>>>>>>> About adding `numOfPartitions` to `Grouped`.
I am wondering if the
> >>>>>> same
> >>>>>>>>>> argument as for `Produced` does apply and adding
it is semantically
> >>>>>>>>>> questionable? Might be good to get opinions
of others on this, too.
> >>>> I
> >>>>>> am
> >>>>>>>>>> not sure myself what solution I prefer atm.
> >>>>>>>>>>
> >>>>>>>>>> So far, KS uses configuration objects that allow
to configure a
> >>>>>> certain
> >>>>>>>>>> "entity" like a consumer, producer, store. If
we assume that a topic
> >>>>>> is
> >>>>>>>>>> a similar entity, I am wonder if we should have
a
> >>>>>>>>>> `Topic#withNumberOfPartitions()` class and method
instead of a plain
> >>>>>>>>>> integer? This would allow us to add other configs,
like replication
> >>>>>>>>>> factor, retention-time etc, easily, without
the need to change the
> >>>>>> "main
> >>>>>>>>>> API".
> >>>>>>>>>>
> >>>>>>>>>> Just want to give some ideas. Not sure if I
like them myself. :)
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> -Matthias
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
> >>>>>>>>>>> Actually, giving it more though - maybe
enhancing Produced with num
> >>>>>> of partitions configuration is not the best approach. Let me
explain
> >>>> why:
> >>>>>>>>>>>
> >>>>>>>>>>> 1) If we enhance Produced class with this
configuration, this will
> >>>>>> also affect KStream#to operation. Since KStream#to is the final
sink of
> >>>> the
> >>>>>> topology, for me, it seems to be reasonable assumption that
user needs
> >>>> to
> >>>>>> manually create sink topic in advance. And in that case, having
num of
> >>>>>> partitions configuration doesn’t make much sense.
> >>>>>>>>>>>
> >>>>>>>>>>> 2) Looking at Produced class, based on API
contract, seems like
> >>>>>> Produced is designed to be something that is explicitly for
producer
> >>>> (key
> >>>>>> serializer, value serializer, partitioner those all are producer
> >>>> specific
> >>>>>> configurations) and num of partitions is topic level configuration.
And
> >>>> I
> >>>>>> don’t think mixing topic and producer level configurations
together in
> >>>> one
> >>>>>> class is the good approach.
> >>>>>>>>>>>
> >>>>>>>>>>> 3) Looking at KStream interface, seems like
Produced parameter is
> >>>>>> for operations that work with non-internal (e.g topics created
and
> >>>> managed
> >>>>>> internally by Kafka Streams) topics and I think we should leave
it as
> >>>> it is
> >>>>>> in that case.
> >>>>>>>>>>>
> >>>>>>>>>>> Taking all this things into account, I think
we should distinguish
> >>>>>> between DSL operations, where Kafka Streams should create and
manage
> >>>>>> internal topics (KStream#groupBy) vs topics that should be created
in
> >>>>>> advance (e.g KStream#to).
> >>>>>>>>>>>
> >>>>>>>>>>> To sum it up, I think adding numPartitions
configuration in
> >>>> Produced
> >>>>>> will result in mixing topic and producer level configuration
in one
> >>>> class
> >>>>>> and it’s gonna break existing API semantics.
> >>>>>>>>>>>
> >>>>>>>>>>> Regarding making topic name optional in
KStream#through - I think
> >>>>>> underline idea is very useful and giving users possibility to
specify
> >>>> num
> >>>>>> of partitions there is even more useful :) Considering arguments
against
> >>>>>> adding num of partitions in Produced class, I see two options
here:
> >>>>>>>>>>> 1) Add following method overloads
> >>>>>>>>>>> * through() - topic will be auto-generated
and num of partitions
> >>>>>> will be taken from source topic
> >>>>>>>>>>> * through(final int numOfPartitions) - topic
will be auto
> >>>>>> generated with specified num of partitions
> >>>>>>>>>>> * through(final int numOfPartitions, final
Produced<K, V>
> >>>>>> produced) - topic will be with generated with specified num
of
> >>>> partitions
> >>>>>> and configuration taken from produced parameter.
> >>>>>>>>>>> 2) Leave KStream#through as it is and introduce
new method -
> >>>>>> KStream#repartition (I think Matthias suggested this in one
of the
> >>>> threads)
> >>>>>>>>>>>
> >>>>>>>>>>> Considering all mentioned above I propose
the following plan:
> >>>>>>>>>>>
> >>>>>>>>>>> Option A:
> >>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>> 2) Add num of partitions configuration to
Grouped class (as
> >>>>>> mentioned in the KIP)
> >>>>>>>>>>> 3) Add following method overloads to KStream#through
> >>>>>>>>>>> * through() - topic will be auto-generated
and num of partitions
> >>>>>> will be taken from source topic
> >>>>>>>>>>> * through(final int numOfPartitions) - topic
will be auto
> >>>>>> generated with specified num of partitions
> >>>>>>>>>>> * through(final int numOfPartitions, final
Produced<K, V>
> >>>>>> produced) - topic will be with generated with specified num
of
> >>>> partitions
> >>>>>> and configuration taken from produced parameter.
> >>>>>>>>>>>
> >>>>>>>>>>> Option B:
> >>>>>>>>>>> 1) Leave Produced as it is
> >>>>>>>>>>> 2) Add num of partitions configuration to
Grouped class (as
> >>>>>> mentioned in the KIP)
> >>>>>>>>>>> 3) Add new operator KStream#repartition
for creating and managing
> >>>>>> internal repartition topics
> >>>>>>>>>>>
> >>>>>>>>>>> P.S. I’m sorry if all of this was already
discussed in the mailing
> >>>>>> list, but I kinda got with all the threads that were about this
KIP :(
> >>>>>>>>>>>
> >>>>>>>>>>> Kind regards,
> >>>>>>>>>>> Levani
> >>>>>>>>>>>
> >>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze
<
> >>>>>> levani.codes@gmail.com <mailto:levani.codes@gmail.com>>
wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hello,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I would like to resurrect discussion
around KIP-221. Going through
> >>>>>> the discussion thread, there’s seems to agreement around usefulness
of
> >>>> this
> >>>>>> feature.
> >>>>>>>>>>>> Regarding the implementation, as far
as I understood, the most
> >>>>>> optimal solution for me seems the following:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1) Add two method overloads to KStream#through
method (essentially
> >>>>>> making topic name optional)
> >>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
configuration
> >>>> field.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Those two changes will allow DSL users
to control parallelism and
> >>>>>> trigger re-partition without doing stateful operations.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I will update KIP with interface changes
around KStream#through if
> >>>>>> this changes sound sensible.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Kind regards,
> >>>>>>>>>>>> Levani
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
>

Mime
View raw message